GVS: Genome Variation Server
An NHLBI Program for Genomic Applications  

GVS Frequently-Asked Questions

To submit a question for possible inclusion here, please use the "Contact Us" link on the home page.

Does my gene search include SNPs in the promoter region?

If you are searching the database, no promoter region is included. Only the transcribed portion of the gene is selected. However, you may extend the region with "upstream" and "downstream" in the same form that you enter the gene name (or ID). "Upstream" is on the 5' end, and "downstream" is on the 3' end of the gene, so that extending upstream can be used to include a promoter region. If you are searching by candidate gene, there is usually a few kb upstream and downstream included, depending on the region selected in the original sequencing.

How can I find out if there are alternative transcripts for my gene?

Query the gene, and click the "display snp summary" button. Look at the "Function" column. If there are SNPs in the transcribed region of the gene, the function will be a link. Clicking on this link brings up the gene models for each transcript. The gene BRCA1 is an example of a gene having many alternative transcripts.

How can I find SNPs that are in linkage disequilibrium with a given SNP?

(1) select "dbSNP rsID" on the home page
(2) enter the rs ID of the SNP, and extend the chromosome region upstream and/or downstream to define the region of nearby SNPs (large enough to contain a number of other SNPs)
(3) set "Display SNPs By" to Custom-Text
(4) click on the "display linkage disequilibrium" button
(5) select the radio button "SNPs paired with ..."
(6) choose an r2 threshold and select annotation
(7) click the submit button

How does tag-SNP selection in GVS compare to that of the ldSelect Perl script at http://pga.gs.washington.edu?

The original Perl algorithm has been re-written in Java. The Java version has been augmented with the option to put SNPs that have many missing genotypes (NN) into separate bins, as missing genotypes skew the results. For the same genotypes, if the "Data Coverage for Tag SNPs" and "Data Coverage for Clustering" parameters are set to 0, and the "r2 Threshold" is set the same (default is 0.64 for ldSelect and 0.80 for GVS), then the tag SNPs should be the same.

What are "other SNPs" in the tag-SNP table?

For each bin, the "tag SNPs" are the SNPs for which the pairwise-r2 values between them any other SNPs in the bin are greater than or equal to the r2-threshold you defined by the "r2 Threshold" parameter, while the "other SNPs" are SNPs for which the pairwise-r2 values to one or more SNPs in the bin are less than the r2-threshold. The "tag SNPs" are the better choice for genotyping medical samples, though "other SNPs" may be chosen if the "tag SNPs" cannot be used (for example if there are technical genotyping problems). If there is more than one tag SNP for a bin, only one of them needs to be chosen for genotyping. If you have more information about some tag SNPs, such as SNPs in the promoter region, in coding regions, or in utr, or any biological relevance etc., you can use that information to decide which of the tag SNPs in a bin to use.

How are the data coverage parameters used for selecting tag SNPs?

The r2 values are calculated for each SNP pair, and if either SNP is missing data for a particular individual, that individual's data is ignored for both SNPs. That is, N11, N12 etc. on the page http://gvs.gs.washington.edu/GVS/HelpLinkageDisequilibrium.jsp will not include those individuals.

When both genotype coverage parameters are set to zero, each SNP is looked at, and the number of other SNPs exceeding the r2 threshold is counted. The SNP with the maximum number is identified, and it plus all the SNPs with r2 above threshold are put in a bin. (Some will later be tag SNPs and some will be "other" SNPs.) This bin of SNPs is removed from the pool, and the process is repeated for the remainder of the SNPs, giving a second bin. Once all SNPs are binned, the process ends.

When the genotype coverage parameters are used, each SNP that is above the tag-SNP-coverage fraction is looked at, and the number of other SNPs exceeding both the r2 threshold and the cluster-coverage fraction is counted. The SNP with the maximum number is identified, and it plus all the SNPs with r2 and coverage above thresholds are put in a bin. (Some will later be tag SNPs and some will be other SNPs. Only those with coverage above the tag-SNP-coverage fraction can be called tag SNPs, unless in a bin by themselves.) This bin of SNPs is removed from the pool, and the process is repeated for the remainder of the SNPs, giving a second bin. Once all SNPs are binned, the process ends. Any low-coverage SNPs that have not been binned are put into separate bins.

As an example, if you choose the gene ACTB and select PDR90, then display tag SNPs, you'll see that there is a SNP at 5340837 that has a lot of missing data, and that it is in a bin by itself. If you set the coverage fractions to zero, that SNP will be put in the first bin. In this case, it's not selected as a tag SNP, but there have been other cases where that did happen.

The optimum choice of coverage thresholds depend on how much missing data there is in the data-set chosen. The threshold should probably be high enough to weed out SNPs with coverage much lower than average. Looking at the patterns in the VG2 plot is useful. An unfortunate aspect of this is that SNPs with low coverage end up in a bin by themselves, and thus qualify as "tag SNPs", but in the tables, these are identified by surrounding brackets. (See next FAQ.)

What do the square brackets mean in a tag-SNP table?

The brackets indicate a SNP that has a lot of missing data (NN for a large number of samples), specifically that the fraction of measured genotypes is below the "Data Coverage For Tag SNPs" threshold. Often these bracketed SNPs will be in separate bins (if also below the "Data Coverage For Clustering" threshold). They are considered less desirable choices for genotyping in association studies.

How can I automate GVS searches?

The first choice is to use Batch GVS at http://gvsbatch.gs.washington.edu/GVSBatch/. If this is not adequate, it's possible to write a screen scraper. GVS is a Java Enterprise site using Java servlets and JavaServer Pages running on a JBoss server. The navigation between pages works with a session ID that is passed back and forth in cookies. A screen scraper could be written in any language; it must look at the http headers and read cookies. This is not trivial. If you manage it, please put some pauses in the code so the server is not loaded down.
Skip footer links and go to content