There are 4 steps to access GVS information:|
1. select the search type
2. select the data source
3. set query and analysis parameters (optional)
4. choose the results to be displayed
Choose the search type on the home page. There are 3 categories: "search database" (the most common option), "search candidate genes",
and "input from file".
Within "search database", there are five different methods for querying variations:
|A. chromosomal location (NCBI 38/hg38)|
|B. gene name (HUGO, upper or lower case; synonyms are ok but may be case-sensitive)|
|C. gene ID (from NCBI Entrez Gene)|
|D. dbSNP rs ID|
For options A through D, the next page presents a form for the chromosome region, gene, or rs ID. In the cases of B through D, you have the option to extend the chromosome region.
In cases B and C, "upstream" is on the 5' end, and "downstream" is on the 3' end of the gene. For case D, "upstream" and "downstream" are relative to the genome assembly (NCBI 38 or hg38).
In the "browse" case E, you can choose a 10-Mb section of a chromosome on the next page, optionally navigate on the resulting map to a region of interest, and select a gene.
When a search by gene name or gene ID is made, there are sometimes alternative transcripts. Preference is given to transcripts with
an accession ID beginning with NM_ (NCBI RefSeq). If there is at least one such NM_transcript, the longest NM_ transcript is chosen.
Otherwise the longest transcript of any kind is chosen. If there is a tie in the number of transcribed bases, the transcript with the largest number of coding bases is selected.
The chosen transcript is displayed in the header information when the Text display option is chosen (see below). If you desire more control
over the genomic region, choose the chromosomal location search type.
For "search candidate genes" there are 2 choices:
PGA (SeattleSNPs Programs for Genomic Applications) and
EGP (Environmental Genome Project SNPs Program at the University of Washington). Most of this data is also available in our local database
(via dbSNP). Note that this data usually includes some genotypes upstream and downstream of the transcribed region.
If you select "input from file" you will be able to upload a file of genotypes for analysis.
Database queries give genotype search results
in a table of data sets categorized by the submitter and the population in which the
variations were identified, with the populations having the most genotyped polymorphisms
for your query appearing at the top of the list.
If HapMap phased genotypes are available, there will be a column indicating the number.
From the top table select one or more Population/Submitter data sets.|
Select one of the genes that were resequenced.|
Select your genotype file ("Choose File"). The file must have
one line for each genotype, each with 4 white-space-separated values:|
| (a) the position (or other identifying string) of the variation|
| (b) the sample ID|
| (c) the first allele|
| (d) the second allele|
| An unknown allele should be indicated as "N". If either allele is "N", the genotype will be set to NN.
If a "?" character is submitted for either allele, the genotype will be set to NN.
If there are any header lines, there must be a "#" at the beginning of the line.
Here is an example. If you
have genotypes in an Excel spreadsheet with these 4 columns, and save it as "Text (Tab delimited)", it should work.
If you encounter a message that the file is of type application/octet-stream, try adding ".txt" to the end of the file name.|
|See also "Interpret SNP IDs as Chromosome Positions" below.|
|All the parameters have defaults, so setting any of them is optional. More parameters can be found after clicking the
"Show More Parameters" button. (Additional ones have a green border.)|
|Merge Samples and Variations: A - common samples with combined variations,
genotypes will be output for the samples common to all selected data sets and combined variations from all selected data sets.
B - combined samples with common variations, genotypes will be output for the variations common to all selected data sets and
combined samples from all selected data sets. C - combined samples with combined variations, genotypes will be output for combined
variations and combined samples from all selected data sets. See this link for details.|
|Interpret SNP IDs as Chromosome Positions:
This checkbox appears only for file input (not database or candidate-gene searches).
If turned on, the identifying string of the variation in the input file must be of the form chr*:* (e.g. chr7:5533400)
designating the chromosome location in hg38 (NCBI build GRCh38) coordinates (base-1). The initial "chr" is optional.
If this option is chosen, the rs ID of a known SNP at that location will appear in the SNP summary output, as well as the SNP annotation.
If the SNP ID in the file is not of the expected format, the search will proceed, but the rs ID and any annotation will be set to "unknown".
|Output SNPs By: type of identifier for the variation|
|rs ID or Position are the choices for data from the GVS database,
where rs ID is the dbSNP reference id for a SNP, and
Position is the chromosome location mapped to the human genome reference sequence based on NCBI build GRCh38 (December 2013).
Under Position there are two choices that affect only the visual genotype graph: Position in graph and rs ID and Position in graph.
In the latter case, both the values are displayed in the graph. In all cases, rs ID or Position, the variations
in the graphs are shown in order of chromosome position (if not clustered by linkage disequilibrium or tag SNPs).
In the case of data from PGA or EGP, the choices are Local Position for the local reference position and Chromosome Position
for the chromosome location mapped to the build-38 sequence. SNP ID in File is the only choice if you are loading your own
genotypes from a file. The first column in your white-space-separated input file will be treated as
the variation identifier (though it need not be a position, just any unique identifier).
|Output Individuals By: type of identifier for the individual
(for database searches only, available after "Show More Parameters" clicked)|
|dbSNP IDs or Submitter IDs are the choices for data from the GVS database,
where dbSNP IDs is the dbSNP population ID followed by a colon and the dbSNP individual ID, and
Submitter IDs is a comma-separated list of IDs by the submitters (for the population selected) to dbSNP.
For example, for HapMap data, the submitter IDs will be the Coriell identifiers beginning with "NA".
See the list of list of populations and their IDs
and the list of individuals and their IDs in our database. If Submitter IDs
and multiple populations are chosen, and there are overlapping individuals, the overlapped individual IDs will revert to the dbSNP ID
with a population of 0 (e.g. 0:5133).
|Display SNPs By: a format for variation and genotype results|
|The Table/Image option prompts for a choice of table or graphical format.
The table provides a number of links to other sites.|
The Text option will present space-delimited results.
The space-delimited output can be saved into an ascii file, and is designed to be easily parsed for further computer analysis.
The Custom-Text option allows further choices of file format and annotation.
For genotype output, there is a choice of "prettybase", PHASE, or Haploview formats,
or download of a tarball containing all three. In the case of Haploview, two
files must be generated, one for the genotypes, the other for the marker information. In the marker information file, the first column is a SNP identification string,
and the second is the SNP position. In the case of database searches, the identification string is the rs ID. In the case of file input, the identification string is set to the position.
In the case of PGA or EGP searches, the identification string is normally the rs ID, but if the rs ID is unknown, it is set to "unknown" followed by the position.
Trialleles are included in "prettybase" or PHASE outputs, but are excluded in Haploview output (as Haploview does not allow trialleles). In that case, the least frequent
allele is determined, and the genotype for any individual having that allele is set to NN. In addition, in the Haploview genotype output, all X/X genotypes are replaced by N/N,
as Haploview interprets X as a third allele.
In the Haploview and PHASE cases, SNPs alleles are the A, C, G, T bases, and indels are
1 for deletion, 2 for insertion. In the PHASE case, SNPs are listed by position rather than by rs ID, independent of the "Display SNPs By" setting,
as position is required by PHASE. If genotypes are submitted by file and Custom-Text is used for PHASE output, the SNP identifiers must be numerical base-pair positions,
and must represent the relative order and spacing of the SNPs.
For tag SNP output, there is the choice of one bin per line or one SNP per line; for the latter, there is a choice of SNP annotation.
For linkage disequilibrium, the choices are one pair per line or a 2-dimentional matrix (symmetrical, but both halves output).
For this LD output, there is a third choice if the search type (on the home page) is rs ID,
and the search region has been expanded upstream and/or downstream. The output then lists the LD only for pairs of SNPs where one member of the pair is the search SNP.
By using the optional r2 cutoff, is then possible to ask what nearby SNPs are in high linkage disequilibrium with a given SNP.
For SNP summary, the output format is the
same as that of the text display, but with a choice of annotation columns.
|Allele Frequency Cutoff (%): cutoff for filtering variations by minor-allele frequency (in percent, range 0 through 50)|
|No Monomorphic Sites: if turned on, all monomorphic sites will be filtered from the output and analysis|
|No HapMap 3: if turned on, some HapMap 3 data is suppressed (for database searches only, available after "Show More Parameters" clicked)|
For a given SNP, the minor-allele frequency is calculated and rounded to the nearest integer. When the integer is greater than or equal to the Allele Frequency Cutoff, the SNP is retained.
The actual frequency cutoff is thus 0.5% below the Allele Frequency Cutoff set in the form. For example, setting the cutoff to 5% results in an actual cutoff of 4.5%.
For genotypes, linkage disequilibrium, and SNP summary, if there are multiple population groups, the frequency and no-monomorphic filters are applied to the merged set of genotypes.
For tag SNPs, if there are multiple population groups, these filters are applied to the set of genotypes for each population
group (African, European, Asian, Amerindian, Hispanic, or Unknown); tag SNPs are then selected for each population group and these are submitted to the
MultiPop-TagSelect algorithm; in the rare case that after the filtering there is only one population group with genotypes left, the calculation reverts to the single-group case, and the
MultiPop-TagSelect algorithm is not invoked.|
The HapMap-3-suppression checkbox is relevant only if one or more of the original 4 HapMap populations have been checked in the "Select Data Set" table: CEU, HCB, JPT, YRI. In these 4
cases, the HapMap3 data has been assigned the same population ID in the GVS database, and will show up merged with the earlier HapMap data, if not suppressed. For the other 8 HapMap 3 populations, data can
be suppressed by not selecting the checkboxes in the "Select Data Set" table.
|Cluster SNPs: if turned on, variations will be clustered based on the similarities of their genotype patterns (case of "display genotypes"), or their r2 LD (Linkage Disequilibrium) values (other output cases), in the graphical displays|
|Cluster Samples: if turned on, samples will clustered based on the similarities of their genotype patterns in the graphical displays|
|Data Coverage (%) for Tag SNPs: minimal data coverage in percent for a variation to be considered as a potential tag SNP (range 0 through 100)|
|Data Coverage (%) for Clustering: minimal data coverage in percent for a variation to be clustered potentially with other variations (range 0 through 100)|
|r2 Threshold: minimal value for variations to belong to the same cluster (range 0.0 through 1.0)
In the LDSelect algorithm of Carlson et al. (see reference below), the r2 values are calculated for each pair of SNPs.
When the coverage thresholds are both 0, each SNP is then
examined, and the number of other SNPS exceeding the r2 threshold is counted. The SNP with the greatest number of other SNPs with r2 above threshold is identified.
That SNP and the others are put into a bin. This bin of SNPs is removed from the pool, and the process is repeated for the remaining SNPs, giving a second bin. Once all the SNPs are binned the algorithm ends.
When there are finite coverage thresholds, each SNP having coverage above the tag-SNP-coverage threshold is examined, and the number of other SNPs exceeding both the r2 threshold
and the cluster-coverage threshold is counted. The SNP with the greatest number above these thresholds is identified. That SNP and the others are put into a bin. This bin is removed from the pool and
the process is repeated until only low-coverage SNPs remain. These low-coverage SNPs are put into separate bins.
Following the binning, the SNPs in each bin are divided into "tag SNPs" and "other SNPs". Any SNPs for which the r2 value with any other SNP in the bin is below the r2 threshold
will be labeled an "other SNP". In addition, if the coverage for the SNP is below the tag-SNP-coverage threshold, it will labeled an "other SNP".
|(available after "Show More Parameters" clicked)|
|LD Minimum: the r2 value corresponding to the lower boundary of the color spectrum for LD plots (range 0.0 through 1.0)|
|LD Maximum: the r2 value corresponding to the upper boundary of the color spectrum for LD plots (range 0.0 through 1.0)|
|Color Scheme: choice of colors, grayscale, or black-and-white|
Once the data sets are chosen and the parameters are set, you have a choice of 4 buttons to click (they can be clicked consecutively without re-starting the search).
The first is "display genotypes" for listing the genotypes for all samples and all variations in the data set. A visual genotype graph (if Table/Image) can be chosen to show color-coded genotypes.
If HapMap phased genotypes are available for the population(s) selected (for display Table/Image or Custom-Text), the button label will change to "observed/phased genotypes";
in the case of Table/Image, the window prompting for a choice of table or graphical format will then include links for phased genotypes as well as the usual (observed) genotypes; in the case of
Custom-Text, there will be an additional radio button with which to choose a text output of the phased genotypes. Note that in this last case, most of the radio buttons choose a format
for (unphased, observed) genotypes, but that the "HapMap Phased Genotypes" choice yields a different set of results.
The second button is "display tag snps", which shows variations binned by similar values of the r2 LD value (see references below), plus a visual genotype graph (if Table/Image) with the binned
variations clustered. This data is useful for the development of a minimal set of SNPs that can be used for large-scale genotyping of similar sample populations (by selecting
one variation from each bin).
The "Tag SNPs" are those for which the pairwise-r2 values between the SNP and any other SNP in the bin are greater than the "r2 Threshold" parameter chosen.
The "Other SNPs" are those for which the pairwise-r2 value between the SNP and at least one other SNP in the bin is less than the r2 threshold. (It is preferable
to choose a SNP from "Tag SNPs" rather than "Other SNPs" to represent the bin, if no other constraints exist.)
If you have chosen multiple populations having individuals from different population groups, and if you have chosen merge option B or C, GVS will
automatically run the MultiPop-TagSelect algorithm, and there will be additional sections in the displayed results.
On the table page, SNPs with coverage below the "data coverage for Tag SNPs" cutoff are displayed with brackets around the SNP ID
(in order to identify those possibly placed into separate bins just because of low coverage). In the case of a multipop section, brackets
indicate that at least one population has low coverage. Indels are included in the tables, though they may not be
suitable as genotyping candidates.
The third button "display linkage disequilibrium" gives the values of r2 for every pair of variations, plus a visual genotype graph (if Table/Image) with the
addition of a graph showing pairwise LD, color-coded according to the degree of correlation. If the search type (on the home page) is rs ID,
and the search region has been expanded upstream and/or downstream, and "Display SNPs By" is set to
Custom-Text, there is an additional option available. This text option lists the LD only for pairs of SNPs where one member of the pair is the search SNP, with choice of annotation
and an r2 cutoff. It is thus possible to ask what nearby SNPs are in high linkage disequilibrium with a given SNP (see the FAQ page or the build
notes for version 3.0.8 for the procedure).
There must be genotypes for the search SNP in the population(s) chosen.
The fourth button "display snp summary" presents a large number of calculated values and annotations for the variations and (for database queries) a map of the chromosome region.
The GVS page "SNP Summary Columns" details the quantities displayed.
If "Text" or "Custom-Text" has been chosen, it is possible from some browsers to save the output as a text file. If your browser does not have a save-as-text option (e.g. Mac Safari), you will
have to copy and paste. The fields will
be space-delimited. If you import the saved file to Excel, it will be necessary to choose "Data/Get External Data/Import Text File" and select "Delimited" and "Space". If the output
has columns that are comma-separated numbers, it will be necessary to force Excel to treat those columns as text.
|Maps showing gene and variation locations are available at several locations on this site.|
how we calculate linkage disequilbrium on this GVS server
For a discussion of linkage disequilibrium and tag SNPs see
Carlson CS, Eberle MA, Rieder MJ, Yi Q, Kruglyak L, Nickerson DA. Selecting a Maximally Informative Set of
Single-Nucleotide Polymorphisms for Association Analyses Using Linkage Disequilibrium, Am. J. Hum. Genet. (2004), 74:106-120.
For the MultiPop-TagSelect algorithm see
Howie BN, Carlson CS, Rieder MJ, Nickerson DA. Efficient Selection of Tagging Single-Nucleotide Polymorphisms in Multiple Populations, Hum Genet (2006) 120: 58-68.
Sources of Data for GVS
How To Use GVS (this page)
SNP Summary Columns
Navigating the Map
File Input Example
List of Populations and their IDs in our Database
List of Populations and their IDs in our Database, HapMap Only
List of Individuals and their IDs in our Database
List of Individuals and their IDs in our Database, HapMap Only