Allele frequencies and linkage disequilibrium patterns vary among population groups. The tag-SNP selection algorithm
used for GVS (that of the Carlson et al. paper)
is applicable only to a single population. If multiple populations
are selected, GVS will automatically divide the genotypes into population groups, perform the tag-SNP selection
for each group, and feed those tag SNPs to the MultiPop-TagSelect algorithm. (The "other SNPs" are not used.)
The output table or text then consists of 3 or more sections: a list of tag SNPs from the MultiPop-TagSelect algorithm
that covers all populations, then the original binned tag SNPs for each population.
The original binned tag SNPs serve as the input to the MultiPop-TagSelect algorithm.
The algorithm attempts to select a near-minimal set of tag SNPs that account for all observed patterns
of linkage disequilibrium in multiple populations. Thus, in the multiple-population list of tag SNPs,
each bin for each population is represented.
The algorithm proceeds in two steps. First, all observed tagSNPs are assigned to mutually exclusive clusters and one
"maximally informative" SNP (one that tags bins in the most populations) is chosen from each cluster. Second, the maximally
informative tagSNPs are assembled into a list and removed one at a time. If a SNP can be removed from the list without
causing any bins to lose representation, it is discarded; otherwise, it is returned to the list. The maximally informative
SNPs that cannot be discarded through this process represent the final set of selected multiple-population SNPs.
The MultiPop-TagSelect algorithm is described in detail in the paper
Howie BN, Carlson CS, Rieder MJ, Nickerson DA. Efficient selection of tagging single-nucleotide polymorphisms in multiple populations, Hum Genet (2006) 120: 58-68.
Each individual in the GVS database has been assigned to one of six population groups, based on dbSNP annotation:
African, European, Asian, Amerindian, Hispanic, or Unknown.
The original Perl program multiPopTagSelect.pl has
been translated to the Java programming language for use in GVS. For the GVS version, tag SNPs are ranked only according to whether
they are in a repeat region or not and according to SNP function
(in the order missense/nonsense/frameshift, splice-site, coding-synonymous or coding, mrna-utr, intron, intergenic).
SNPs in unique regions are always ranked higher than SNPs in repeat regions. (If there are
tag SNPs that are equivalent to each other in terms of how many bins are covered, the ranking is used to break the tie.
Sometimes there are still equivalents after ranking, and an arbitrary choice is made.) No SNPs are excluded or required.
Thus far, the option to seek a provably optimal solution is turned off. Note that frameshift variations are highly ranked,
though these are indels; they may not be suitable as genotyping candidates.
If there are individuals of population group "Unknown" in your search, they will be grouped together and treated as though
a real population group. The results may be invalid, and a warning message will be posted in the result window (if table rather than text).
The results are displayed in the first table on the result page. The first column is the SNP location or identifier, the
second is the SNP function, and the third indicates whether the SNP is in a repeat region or not. Next there is one column for each
population, where the represented bin (from the following separate-population tables) is listed.
In the "Table" display, the last column displays the rs ID if SNP_position output was selected, or the chromosome position if RS_ID was chosen.
In the "Text" display there are two additional columns showing the flanking sequences. The remaining tables, one for
each population class, show the single-population binned tag SNPs.
Below the single chosen tag SNP in the first table, any tag SNPs that are equivalent in bin coverage are listed. These are
enclosed in parentheses. They may be chosen if for some reason the first tag SNP cannot be used.
These alternates may or may not have the same function and repeat status as the first SNP listed.
Because the algorithm is sometimes making arbitrary choices when two choices are equivalent, the final result can depend
on the order in which the SNPs are processed. In the current version of GVS, the Output-SNPs-By choice of SNP_position or RS_ID
can thus affect the results, though the two possibilities should be equivalent in terms of association-study power.