GVS: Genome Variation Server 144
6

GVS Frequently-Asked Questions

To submit a question for possible inclusion here, please use the "Contact Us" link on the home page.

Does my gene search include SNPs in the promoter region?

If you are searching the database, no promoter region is included. Only the transcribed portion of the gene is selected. However, you may extend the region with "upstream" and "downstream" in the same form that you enter the gene name (or ID). "Upstream" is on the 5' end, and "downstream" is on the 3' end of the gene, so that extending upstream can be used to include a promoter region. If you are searching by candidate gene, there is usually a few kb upstream and downstream included, depending on the region selected in the original sequencing.

How can I find out if there are alternative transcripts for my gene?

Query the gene, and click the "display snp summary" button. Look at the "Function" column. If there are SNPs in the transcribed region of the gene, the function will be a link. Clicking on this link brings up the gene models for each transcript. The gene BRCA1 is an example of a gene having several alternative transcripts.

How can I find SNPs that are in linkage disequilibrium with a given SNP?

(1) select "dbSNP rsID" on the home page
(2) enter the rs ID of the SNP, and extend the chromosome region upstream and/or downstream to define the region of nearby SNPs (large enough to contain a number of other SNPs)
(3) set "Display SNPs By" to Custom-Text
(4) click on the "display linkage disequilibrium" button
(5) select the radio button "SNPs paired with ..."
(6) choose an r2 threshold and select annotation
(7) click the submit button

How does tag-SNP selection in GVS compare to that of the ldSelect Perl script at http://pga.gs.washington.edu?

The original Perl algorithm has been re-written in Java. The Java version has been augmented with the option to put SNPs that have many missing genotypes (NN) into separate bins, as missing genotypes skew the results. For the same genotypes, if the "Data Coverage for Tag SNPs" and "Data Coverage for Clustering" parameters are set to 0, and the "r2 Threshold" is set the same (default is 0.64 for ldSelect and 0.80 for GVS), then the tag SNPs should be the same.

What are "other SNPs" in the tag-SNP table?

For each bin, the "tag SNPs" are the SNPs for which the pairwise-r2 values between them any other SNPs in the bin are greater than or equal to the r2-threshold you defined by the "r2 Threshold" parameter, while the "other SNPs" are SNPs for which the pairwise-r2 values to one or more SNPs in the bin are less than the r2-threshold. The "tag SNPs" are the better choice for variations to genotype in a medical study, though "other SNPs" may be chosen if the "tag SNPs" cannot be used (for example if there are technical genotyping problems). If there is more than one tag SNP for a bin, only one of them needs to be chosen for genotyping. If you have more information about some tag SNPs, such as SNPs in the promoter region, in coding regions, or in utr, or any biological relevance etc., you can use that information to decide which of the tag SNPs in a bin to use.

How are the data coverage parameters used for selecting tag SNPs?

The r2 values are calculated for each SNP pair, and if either SNP is missing data for a particular individual, that individual's data is ignored for both SNPs. That is, N11, N12 etc. on the linkage-disequilibrium page will not include those individuals.

When both genotype coverage parameters are set to zero, each SNP is looked at, and the number of other SNPs exceeding the r2 threshold is counted. The SNP with the maximum number is identified, and it plus all the SNPs with r2 above threshold are put in a bin. (Some will later be tag SNPs and some will be "other" SNPs.) This bin of SNPs is removed from the pool, and the process is repeated for the remainder of the SNPs, giving a second bin. Once all SNPs are binned, the process ends.

When the genotype coverage parameters are used, each SNP that is above the tag-SNP-coverage fraction is looked at, and the number of other SNPs exceeding both the r2 threshold and the cluster-coverage fraction is counted. The SNP with the maximum number is identified, and it plus all the SNPs with r2 and coverage above thresholds are put in a bin. (Some will later be tag SNPs and some will be other SNPs. Only those with coverage above the tag-SNP-coverage fraction can be called tag SNPs, unless in a bin by themselves.) This bin of SNPs is removed from the pool, and the process is repeated for the remainder of the SNPs, giving a second bin. Once all SNPs are binned, the process ends. Any low-coverage SNPs that have not been binned are put into separate bins.

As an example, if you choose the gene CASP9 and select PDR90, then display tag SNPs, you'll see that there is a SNP rs4233535 that has a lot of missing data, and that it is in a bin by itself. If you set the coverage fractions to zero, that SNP will be put in a bin with several other SNPs.

The optimum choice of coverage thresholds depend on how much missing data there is in the data-set chosen. The threshold should probably be high enough to weed out SNPs with coverage much lower than average. Looking at the patterns in the VG2 plot is useful. An unfortunate aspect of this is that SNPs with low coverage end up in a bin by themselves, and thus qualify as "tag SNPs", but in the tables, these are identified by surrounding brackets. (See next FAQ.)

What do the square brackets mean in a tag-SNP table?

The brackets indicate a SNP that has a lot of missing data (NN for a large number of individuals), specifically that the fraction of measured genotypes is below the "Data Coverage For Tag SNPs" threshold. Often these bracketed SNPs will be in separate bins (if also below the "Data Coverage For Clustering" threshold). They are considered less desirable choices for genotyping in association studies.

How can I automate GVS searches?

The first choice is to use Batch GVS. If this is not adequate, it's possible to write a screen scraper. GVS is a Java Enterprise site using Java servlets and JavaServer Pages running on a JBoss server. The navigation between pages works with a session ID that is passed back and forth in cookies. A screen scraper could be written in any language; it must look at the http headers and read cookies. This is not trivial. If you manage it, please put some pauses in the code so the server is not loaded down.

Can I combine HapMap data and Seattle SNPs PGA data to get tag SNPs?

This is not recommended. In the Seattle SNPs PGA project, complete genes were resequenced (including some bases upstream and downstream, and including introns if the gene was small, but only a sampling of introns for large genes). The number of individuals was 47 for most of PGA. However, only a few hundred genes were covered by the project. The genes were in general chosen to be those for which variations might be expected to contribute to disease (especially heart disease). SNPs were about 200 bp apart. This was a resequencing project: the complete sequence was measured for a gene for a number of individuals, and all variation was picked out.

The goal of the HapMap project was to sequence the entire genome for several population groups, but the spacing of the SNPs was larger, 1000 to 2000 bp apart. The HapMap group selected a few million SNPs to genotype, and especially looked for SNPs that others had genotyped or that had been seen twice. Few rare SNPs were chosen. This was largely a genotyping project: SNPs were picked and only genotypes there were determined.

When you select projects with differing coverage, GVS merges the two and makes a rectangular grid of position vs. individual that you see in the graphical display. If there is no measurement for one of the cells, it's filled in by NN. If you choose merge mode C, all genotypes are kept. If you choose merge mode B, only those positions are kept for which there are measurements by both projects. If you choose merge mode B for Seattle SNPs plus HapMap, much of the Seattle SNPs data will be thrown out. For most genes, there will be no Seattle SNPs data, and you will be forced to use only HapMap. (However, watch out for HapMap 3 data where fewer SNPs were covered; you may not want to request HapMap 3 data).

If you attempt to combine the two data sets of differing coverage, the results may be distorted for finite coverage thresholds (Tag-SNPs Coverage and Clustering Coverage). The use of coverage thresholds was put in to take care of the odd case when data was coming from one project with almost uniform individual coverage, but for which there was the odd SNP that had a lot of missing data. It was not meant to handle large blocks of NN genotypes, such as those you get by combining data from 2 projects. When there are a large number of SNPs with large numbers of NNs, tag-SNP results will have large numbers of bracketed (low-coverage) SNPs in separate bins.

In the tagSNP algorithm, r2 is calculated for each pair of SNPs. If both projects have data for all individuals for each SNP in the pair, r2 is calculated using individuals from both projects. If Seattle SNPs has data for the two SNPs, but HapMap does not have data for either, r2 will be calculated with Seattle SNPs data. If Seattle SNPs has data for both SNPs, and HapMap has data for one SNP, but not the other, r2 will be calculated with Seattle SNPs data only, as there will be nothing to compare for HapMap. Next the SNPs are grouped into bins according to the r2 values, using the coverage thresholds chosen. Note that some r2 have contributions from both projects, and some don't, so the r2 values are not all equal, though they are to a large degree valid, as only one population group is being considered. However, the use of non-zero coverage thresholds skew the binning in such a way as to put low-coverage SNPs in bins by themselves. If the reason the coverage is low is because one project didn't even attempt the measurement, and that is true for some SNPs but not others, the results may not make sense.

You might be best off by not trying to combine HapMap and Seattle SNPs: use Seattle SNPs if the data is available in a region (as more SNPs will have genotypes), and HapMap otherwise. You could try putting the coverage thresholds low enough that they won't kick in just because the targets are different; if Seattle SNPs has 47 individuals and HapMap has 60, missing HapMap data would have result in a coverage of 47/107 (if no Seattle SNPs data is missing).

How you proceed will depend on whether you are interested in rare SNPs (Seattle SNPs has more of those when data is available), and on what population groups you're interested in (HapMap has more choice). It's important to examine the visual genotype graphics as a sanity check.
Skip footer links and go to content
Privacy Terms National Heart, Lung, and Blood Institute National Heart, Lung, and Blood Institute logo