GVS: Genome Variation Server
An NHLBI Program for Genomic Applications  

Input Data Files
The following files were used to populate the database supporting GVS.
1a. dbSNP genotypes
ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/genotype (June 2008)
NOTE: There are are about ten times more conflicting genotypes in dbSNP build 128 than in build 127 (though the total number of genotypes is not that much different). The majority of the conflicts arise from data submitted in January of 2007, with a submitter ID beginning with PGP. The genotypes in this data set represent about 10% of the total number of genotypes. About 4% of these PGP genotypes are in conflict with other genotypes. We have removed some of these PGP genotypes from our 128 and 129 databases, as there are internal inconsistences. About 10,000 of the PGP SNPs have genotypes that are heterozygous for every HapMap individual. These all-heterozygous genotypes were removed, as well as all genotypes for which there were more than 14 inconsistent genotypes for a given SNP.
1b. HapMap 3 genotypes
http://ftp.hapmap.org/genotypes/2008-07_phaseIII/hapmap_format/forward/ (draft release 1, August 2008)
The HapMap3 genotypes for the CEU, CHB, JPT, and YRI populations were entered into the GVS database with the previous dbSNP population IDs. If the individuals were the same as in phases I and II, those IDs were used. For the 7 new populations, populations were assigned these values: 1001401 for ASW, 1001402 for CHD, 1001403 for GIH, 1001404 for LWK, 1001405 for MEX, 1001406 for MKK, and 1001407 for TSI. For all new individuals, the ID was set to the numerical part of the ID in the HapMap file + 1,000,000 (e.g. NA10837 was given an individual ID of 1010837). Once these genotypes are available from dbSNP, the population and individual IDs will be updated.
2. dbSNP annotations
ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/XML (June 2008, mapping, gene function information, population definitions, submitter information)
3. NCBI gene files and synonyms
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq (July 2008, genes and coding regions)
ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/mapview/seq_gene.md (downloaded May 2008, exons)
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info (July 2008, unoffical names for the gene)
4. UCSC conservation scores
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/phastCons17way/ (April 6, 2006)
5. repeats
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/bigZips/ (Feb.3, 2006)
   chromOut.zip: RepeatMasker
   chromTrf.zip: Tandem Repeats Finder
6. SNPs on chips
files from Affymetrix Inc. and Illumina Inc. (July, 2008) and from Applied Biosystems (April 2006)
7. UCSC sequence files
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/chromosomes/ (March 2006)
8. UCSC chimp alleles
http://hgdownload.cse.ucsc.edu/goldenPath/hg18/vsPanTro2/axtNet/ (July 15, 2006)
9. copy number variation
http://projects.tcag.ca/variation/tableview.asp?table=DGV_Content_Summary.txt, files variation.hg18.v5.txt and indel.hg18.v5.txt(July, 2008)
10. phased genotypes
http://www.hapmap.org/downloads/phasing/2006-07_phaseII/phased/ (July 2006)
The GVS phased genotypes (autosomes only) are those generated by the HapMap group, who ran the PHASE software for HapMap release #21. This "phased" data set (rather than the "all" data set) only includes sites that segregate in the population selected (genotypes not monomorphic for all individuals in the population). If a SNP isn't found in a particular population, it isn't included in the phased genotypes for that population. It will show up as "NN" if multiple populations are combined, and at least one of the populations is not monomorphic there.
 
Skip footer links and go to content