Batch Genome Variation Server 144

How to Use Batch GVS
Batch GVS allows you to submit a text file with a list of genes, hg38 chromosome regions, or rs IDs, and later download a file of genotypes, variant summary information, r2 values, tag SNPs, or haplotypes (for individuals in all populations available, or for individuals and/or populations you specify). Once the search is complete, an e-mail will be sent to you with a link for downloading a file. If a single file is generated, it will be zipped (use gunzip on a Linux machine). If multiple files are generated, a tarball will be available for download (use tar xvzf on a Linux machine).

See the links at the bottom of this page for more information.

The input file must be plain-text. If, for example, you create the file in Microsoft Word, it must be saved as text-only. If you encounter a message that the file is of type application/octet-stream, try adding ".txt" to the end of the file name.

The genotype output file is in a format similar to that of a "prettybase" file. Following a possible header line beginning with #, there is one line for each genotype. The columns are chromosome location (NCBI 38 or hg38), population:individual, first allele, second allele, rs ID, chromosome number, and region.

The variant summary file has several header lines, then one line for each variant, with many annotation columns (see the summary page).

The r2 file has a column header line, then one line for each variation pair, with several annotation columns. The columns are first-variation chromosome location, second-variation chromosome location, r2, first-variation rs ID, second-variation rs ID, chromosome number, and region. The calculation of r2 is described here.

The tag-SNP file has several header lines, then one or more lines for each bin. For the one-line-per-bin format, the columns are bin number, number of variations in the bin, percent average minor allele frequency, list of tagSNPs, and list of other SNPs. Variations are binned by similar values of the r2 linkage disequilibrium value (see references below). This data is useful for the development of a minimal set of variations that can be used for large-scale genotyping of similar sample populations (by selecting one variation from each bin). The "tag SNPs" are those for which the pairwise-r2 values between the variation and any other variation in the bin are greater than the "r2Threshold" parameter chosen (see below). The "other SNPs" are those for which the pairwise-r2 value between the variation and at least one other variation in the bin is less than the r2 threshold. (It is preferable to choose a variation from "tag SNPs" rather than "other SNPs" to represent the bin, if no other constraints exist.) If there are individuals from more than one population class, the MultiPop-TagSelect Algorithm may be requested, and there are then additional sections in the output files.

In the case of fastPHASE haplotype calculations (see reference), there are two options. The first is to download only the phased-genotype calls. The second is to download a tarball with all the files created by fastPHASE, as well as the genotype files created by GVS Batch for input to fastPHASE.

Columns in the output files are white-space separated (a tab in the case of genotypes and r2, and a space in the case of tagSNPs and SNP summary).

File Input: Quick Start
Make a file with a list of genes (one per line) and submit it. The downloaded file will list genotypes. Example file content:
	actb
	alad
	
File Input: the Details
The input for a database search is a file with one or more regions (one per line), and optionally some lines to customize the search or the calculations. The lines can be submitted in any order.

Blank lines are ignored, as are lines beginning with "##" (so that comments can be put in following two #s).

Lines beginning with a "#" indicate optional parameters that can be set. There should be only one line for each parameter, with the exception of individual and population, for which there can be many lines. The line should have a # as the first character, followed by optional whitespace, followed by one of the parameters names, then whitespace (required), then the parameter value. (Any content beyond the parameters value is ignored.) The case-sensitive parameter set is

searchType (default geneName)
displayType (default genotypes)
numberFiles (default single)
fileName (default: no addition to filename)
headerLine (default: no header line)
writeParametersToOutputFile (default false)
r2DisplayCutoff (default 0.0, used only when displayType is r2 or r2LD)
annotation (default: no additional annotation, used only when displayType is snpSummary, tagSNPs, or r2LD)
individual (default: all individuals)
population (default: all populations)
merge (default C)
expandUpstream (default 0, used only when searchType is geneName or geneID or rs)
expandDownstream (default 0, used only when searchType is geneName or geneID or rs)
keepRelatedHapMap (default false)
freqCutoff (default 0)
noMonomorphic (default true)
includeHapMap3 (default true)
r2Threshold (default 0.8, used only when displayType is tagSNPs)
coverageTagSNPs (default 14, used only when displayType is tagSNPs)
coverageClustering (default 12, used only when displayType is tagSNPs)
multipop (default true, used only when displayType is tagSNPs)
tagSNPsFormat (default oneLinePerBin, used only when displayType is tagSNPs)
bracketLowCoverageSNPs (default false, used only when displayType is tagSNPs)
fastPHASERandomStarts (default 20, used only when displayType is fastPHASE)
returnTarballWithAllFastPHASEFiles (default false, used only when displayType is fastPHASE)
fastPHASEUseClockSeed (default false, used only when displayType is fastPHASE)
The searchType parameter can be one of 4 values: geneName, geneID, chromosome, or rs.

The displayType parameter can be one of 6 values: genotypes, snpSummary, r2, r2LD, tagSNPs, or fastPHASE. The r2 type is used to get r2 values for all variation pairs in the region. The r2LD type can only be used if the searchType is rs. This option is designed to display variations in linkage disequilibrium with a given variation. All pairs in the output file have the input rs ID as one element of the pair. If tagSNPs is chosen, listing individuals is not allowed. For displayType fastPHASE, the fastPHASE program is run (on our server), and haplotypes are constructed for the genotypes in the query.

The numberFiles parameter can be single or multiple. If single, all data for the input file is put into a single file; if multiple, a separate data file is generated for each line in the input file.

The fileName parameter is a string of characters appropriate to a file name, with no intervening whitespace. In the absence of such a line, the filename will be GVSBatch144 followed by a unique identifier. When present, the value will be inserted in the file name between GVSBatch144 and the identifier. This parameter is solely for your use in keeping track of files. (The parameter does not need to be unique within a series of queries, as uniqueness is maintained by the identifier.)

The headerLine parameter is for putting an information line at the beginning of each of the result files.

The writeParametersToOutputFile parameter is for putting input parameters at the beginning of each of the result files. If this parameter is set to true (rather than omitted or set to false), a line in the output file is added for each parameter line in the input file: # queryInput parameterName value.

The r2DisplayCutoff (displayTypes r2 or r2LD only) sets a threshold, such that only variation pairs with r2 equal to or greater than this value are written to the output file. The range is 0.0 through 1.0.

The annotation parameter (snpSummary, tagSNPs, or r2LD displayType only) requests variation annotation columns. For the tagSNPs case, the annotation is added only to the single-population sections, but only if the tagSNPsFormat parameter is set to oneLinePerSNP. For the r2LD case, the annotation is that of the second variation (the one that is not the input rs ID). The value of the parameter is a comma-separated list of column names (with no whitespace in the list). The order of the columns in the output file will be the same as the order in the list. The available column names are the same as those used on the interactive GVS site (see the this page).

  Alleles
  MinorAllele
  PercentAlleleFrequency
  Heterozygosity
  Chi-Square
  Genes
  Function
  FunctionGVS
  ConservationScoreGERP
  SubmitterIDs
  ChimpAllele
  GenotypingChipIDs
  RepeatMasker
  TandemRepeatsFinder
  UpstreamFlank
  DownstreamFlank
  NumberAlleles
  NumberMajorAlleles
  NumberMinorAlleles

The individual parameter sets a dbSNP numerical individual ID. Such a line is optional. If there is at least one individual line, genotypes are returned only for the individuals listed in all the "# individual" lines. If the ID is not a number, the line is ignored. These lines are not allowed (thus far) if the displayType is tagSNPs. To find individual IDs see this list of individuals and their IDs in our database (or see this link for HapMap only).

The population parameter sets a dbSNP numerical population ID. Such a line is optional. If there is at least one population line, genotypes are returned only for the populations listed in all the "# population" lines. If the ID is not a number, the line is ignored. To find population IDs see this list of populations and their IDs in our database (or see this link for HapMap only).

If there are both population and individual lines (allowed if displayType not tagSNPs), the search is restricted to genotypes for which the listed individuals belong to one of the listed populations.

The merge parameter affects how genotypes are combined when there is more than one population requested. The choices are A, B, and C. See this link for details. As the default is C, and B and C are the same as A for a single population, it's only necessary to specify this parameter if there are multiple populations, and merge B is desired.
  A - common individuals with combined variations: analyze genotypes for the individuals common to all populations and for combined variations from all populations.
  B - combined individuals with common variations: analyze genotypes for the variations common to all populations and for combined individuals from all populations.
  C - combined individuals with combined variations: analyze genotypes for combined variations and for combined individuals from all populations.

If expandUpstream and/or expandDownstream are included, the region of the genome is extended. Up and down refer to the direction of the genome, not to the direction of the gene for geneName or geneID searches. If the searchType is chromosome, these lines are ignored.

The keepRelatedHapMap parameter is most commonly used for HapMap populations where there are mother-father-child trios. If this parameter is set to true, all genotypes will be analyzed. For tag SNPs it is recommended that the default of false be used. If false, a set of unrelated individuals (mother and father but not child) is used. In spite of its name, this parameter applies to non-HapMap populations as well. This relationship filtering works slightly differently from that of the GVS interactive site. If HapMap and non-HapMap populations are requested, and if there is a child of HapMap parents in the non-HapMap populations, that child will not be included for keepRelatedHapMap false.

The freqCutoff parameter eliminates from consideration variations having minor allele frequencies below the cutoff: range is 0 through 50, an integer in units of percent.

The noMonomorphic parameter (true or false) excludes or includes variations having zero minor allele frequency (i.e. having only one allele).

The includeHapMap3 parameter (true or false or only) includes (true), excludes (false), or selects only (only) genotypes from the HapMap phase 3 project. This parameter applies to all populations, not just the populations 1409-1412, as is the case for the GVS interactive site. If, for example, genotypes are requested for the HapMap3 population 12156, and includeHapMap3 is set to false, no genotypes will be returned.

Six parameters are used only when displayType is tagSNPs. The value r2 is the square of the Pearson correlation coefficient in linkage disequilibrium calculations. The parameter r2Threshold is the minimum value of r2 for variations to belong to the same tag-SNP bin. Its range is from 0.0 through 1.0. For a discussion of linkage disequilibrium and tagSNPs see this link and Carlson et al., Am. J. Hum. Genet., 74:106-120, 2004. The next two tag-SNP parameters are integers, in units of percent. These are designed to put variations with many unknown genotypes into separate bins. The coverageTagSNPs parameter is the minimum data coverage for a variation to be considered as a potential tagSNP: range is 0 through 100. The coverageClustering parameter is the minimum data coverage for a variation to be clustered potentially with other variations: range is 0 through 100. Its value must be less than the coverageTagSNPs value. If the multipop parameter is omitted or is set to true (rather than false), the MultiPop-TagSelect Algorithm will be used for tag-SNP selection if there are individuals in different population groups. When using this algorithm, it is best to specify two or more population parameters. The tagSNPsFormat parameter can be set to either of two values: oneLinePerBin (the default for the original format) or oneLinePerSNP. It affects only single-population sections. If oneLinePerSNP is selected, only one variation is printed on a line, the bins are separated by a blank line, and it's possible to add annotation. If the parameter bracketLowCoverageSNPs is present and set to true, variations with data coverage below the coverageTagSNPs value will have square brackets placed around the rs ID value in the output. If there is a multipop section, brackets will be placed around the rs ID if the coverage is low for any of the populations.

Three parameters are used only when displayType is fastPHASE. The fastPHASERandomStarts parameter selects the number of random starts of the expection-maximization algorithm in fastPHASE. The default is 20. Higher values may achieve greater accuracy, but the calculation time is increased proportionally. If the returnTarballWithAllFastPHASEFiles parameter is set to true (default is false), a tarball will be returned with all the files created by fastPHASE, as well as the genotype files created by GVS Batch for input to fastPHASE. In this tarball there will also be a monitor file that captures any information fastPHASE would normally write to a console window, and a file echoing your query parameters. If returnTarballWithAllFastPHASEFiles is false (or not specified), only the "hapguess_switch.out" file content of fastPHASE is returned, but preceded by a line specifying the order of the variations (as this information is not in the fastPHASE output files, only in the GVS Batch input files). The fastPHASEUseClockSeed parameter is used to seed the random number generator. If set to false (the default), the fastPHASE calculations are performed with a seed of 13579, so successive submissions will produce the same result. If set to true, the server clock will be used to seed the random number generator, and successive submissions will produce slightly different results.
If there are no lines in the input file for a given parameter, the defaults are used: geneName for searchType, genotypes for displayType, single for numberFiles, no file name modification for fileName, no header line for headerLine, writeParametersToOutputFile = false, r2DisplayCutoff = 0.0, no additional annotation, all individuals and populations returned, C for merge, no expansion of the region, false for keepRelatedHapMap and true for includeHapMap3. For tag SNPS, the defaults are 0.8 for r2Threshold, 0 for freqCutoff, true for noMonomorphic, 14 for coverageTagSNPs, 12 for coverageClustering, true for multipop, oneLinePerBin for tagSNPsFormat, and false for bracketLowCoverageSNPs. For fastPHASE, the defaults are 20 for fastPHASERandomStarts, false for returnTarballWithAllFastPHASEFiles, and false for fastPHASEUseClockSeed.

For the case of searchType=geneName, the region lines (one line per region) should each contain the name of a gene (upper or lower case is fine). For the case of searchType=geneID, each line should contain a numerical gene ID. For the case of searchType=chromosome, each line should contain a hg38 (NCBI 38) region in the form chr*:begin-end (see example below), where * is 1 through 22 or X or Y, and "end" must be equal to or larger than "begin". If only one base is to be queried, the "-" and end base are optional: e.g. chr7:5534892 will be interpreted as chr7:5534892-5534892. (This is useful if rs IDs are required for a list of chromosome positions; this one-base option with a displayType of snpSummary will list rs IDs if there are known variations at the locations.) For the case of searchType=rs, each line is the dbSNP rs ID, with or without lower-case "rs" in front of the number. No regions can have whitespace characters in the middle. Any content beyond whitespace is ignored.

File Input Examples
Here are several examples of input files. (If you copy and paste any of these, remove the white-space at the beginnings of the lines.)

Example 1    download example 1
	## list of gene names, each gene in a separate file
	# searchType geneName
	# headerLine example 1
	# fileName genes.example1
	# writeParametersToOutputFile true
	# includeHapMap3 false
	actb
	alad
	
Example 2    download example 2
	## list of gene IDs, all in one file; include additional bases upstream and downstream of the genes; echo the input
	# headerLine example 2
	# fileName testIDs.example2
	# searchType geneID
	# numberFiles single
	# writeParametersToOutputFile true
	# includeHapMap3 false
	# expandUpstream 1500
	# expandDownstream 3000
	60
	6624
	
Example 3    download example 3
	## chromosome region, only two individuals, one related
	# headerLine example 3
	# fileName chromosome.example3
	# writeParametersToOutputFile true
	# includeHapMap3 false
	# searchType chromosome
	# numberFiles single
	# individual 362
	# individual 349
	# noMonomorphic false
	# keepRelatedHapMap true
	chr7:5300000-5400000
	
Example 4    download example 4
	## rs IDs, one file, the rs is optional, only for population 693
	# headerLine example 4
	# fileName rs.example4
	# writeParametersToOutputFile true
	# includeHapMap3 false
	# searchType rs
	# numberFiles single
	# population 693
	rs7612
	rs7161563	
	
Example 5    download example 5
    ## snp annotation for snps having genotypes for the gene pcsk9, population 1623
    # headerLine example 5
    # writeParametersToOutputFile true
    # includeHapMap3 false
    # numberFiles single
    # displayType snpSummary
    # population 1623
    # fileName pcsk9.1623.snpSummary.example5
    pcsk9
	
Example 6    download example 6
	## tag SNPs for pcsk9 and vkorc1 with non-standard thresholds, population 595
	# headerLine example 6
	# writeParametersToOutputFile true
	# includeHapMap3 false
	# numberFiles single
	# searchType geneName
	# fileName pcsk9.vkorc1.595.tagSNPs.example6
	# displayType tagSNPs
	# population 595
	# freqCutoff 10
	# coverageTagSNPs 90
	# coverageClustering 75 
	# r2Threshold 0.75
	pcsk9
	vkorc1
	
Example 7    download example 7
	## tag SNPs for pcsk9 and vkorc1, with MultiPop-TagSelect, populations 596 and 595, annotation
	# headerLine example 7
	# writeParametersToOutputFile true
	# includeHapMap3 false
	# numberFiles single
	# searchType geneName
	# fileName pcsk9.vkorc1.596.595.tagSNPs.multipop.example7
	# displayType tagSNPs
	# coverageTagSNPs 85
	# coverageClustering 70 
	# population 596
	# population 595
	# multipop true
	# tagSNPsFormat oneLinePerSNP
	# annotation Function,FunctionGVS,ConservationScoreGERP,PercentAlleleFrequency,Genes
	pcsk9
	vkorc1
	
Example 8    download example 8
	## genotypes for the gene pcsk9, HapMap populations, keep related individuals
	# headerLine example 8
	# writeParametersToOutputFile true
	# includeHapMap3 false
	# numberFiles single
	# displayType genotypes
	# fileName pcsk9.HapMap.genotypes.all.example8
	# keepRelatedHapMap true
	# population 1412
	# population 1411
	# population 1410
	# population 1409
	pcsk9
	
Example 9    download example 9
	## r2 for the rs ID 2070589, plus SNPs 100K bases on each side, show only r2>=0.5
	# headerLine example 9
	# writeParametersToOutputFile true
	# includeHapMap3 false
	# searchType rs
	# numberFiles single
	# displayType r2
	# r2DisplayCutoff 0.5
	# fileName 2070589.HapMap.r2.example9
	# population 1412
	# population 1411
	# population 1410
	# population 1409
	# expandUpstream 100000
	# expandDownstream 100000
	2070589
	
Example 10    download example 10
	## r2 for the rs ID 6899515, plus SNPs 100K bases on each side, show only r2>=0.5, only for pairs where one is 6899515
	## additional annotation requested for the other member of the pair
	# headerLine example 10
	# writeParametersToOutputFile true
	# includeHapMap3 false
	# searchType rs
	# numberFiles single
	# displayType r2LD
	# r2DisplayCutoff 0.5
	# fileName 6899515.HapMap.r2LD.example10
	# population 1412
	# population 1411
	# population 1410
	# population 1409
	# expandUpstream 100000
	# expandDownstream 100000
	# annotation Function,FunctionGVS,ConservationScoreGERP,PercentAlleleFrequency,Genes,RepeatMasker,TandemRepeatsFinder
	6899515
	
Example 11    download example 11
	## run fastPHASE for two genes, return only phased genotypes, 20 random starts (the default)
	# numberFiles single
	# headerLine fastPHASE
	# searchType geneName
	# displayType fastPHASE
	# returnTarballWithAllFastPHASEFiles false
	# fastPHASERandomStarts 20
	# freqCutoff 0
	# noMonomorphic true
	# writeParametersToOutputFile true
	# includeHapMap3 false
	# population 596
	abo
	vkorc1
	
Additional, Specialized Parameters

Setting the omitSingleSNPBins parameter to true suppresses all tag SNP bins with only one SNP in them. (The default is false.) If there is only one population, this simply has the effect that such bins do not appear in the output file. If the MultiPop-TagSelect Algorithm is being used, only those bins with at least two SNPs (either tagSNPs or other SNPs) in them will be fed into the algorithm. This parameter should be used with caution, as it suppresses much known variation, and affects the mix of ancient versus recent mutations.


Setting the addR2ToTagSNPs (used only for the tagSNPs displayType) to true adds an additional section in the output file: the r2 values for each pair in each bin. This option is so far available only for single-population tag SNPs.


The chipFilter parameter is used for the snpSummary and r2LD displayTypes. In the snpSummary case, only those variations on the designated chip will be written to the output file. In the r2LD case, the output lines are filtered so that only comparison variations on a particular chip are written to file. The value of the chipFilter parameter is one (and only one) chip ID (e.g. A5):
  A9 Affymetrix Genome-Wide Human SNP Array 6.0
  I6Q Illumina Human610-Quad BeadChip
  I7 Illumina OmniExpress
  I10 Illumina Human1M BeadChip


There is an additional searchType parameter: chipID. It is used to request annotation for all variations on a given chip. This type of search, in the interest of speed, does not access any genotypes. Most parameters are ignored. The only required parameters are searchType and chipFilter (A9, I6Q, etc. as in the list above). There are no region lines. The optional parameters headerLine and fileName have the same function as in other modes.

An example input file would be
	# headerLine I7ChipAnnotation
	# fileName I7Chip
	# searchType chipID
	# chipFilter I7
	
The columns in the single returned file are:

  base(NCBI.38)
  rsID
  Genes
  Function
  ConservationScoreGERP
  SubmitterIDs
  RepeatMasker
  TandemRepeatsFinder
  UpstreamFlank
  DownstreamFlank
  InputFileRegion (the chip ID)

If the parameter returnTarballWithEmailMessage is set to true, the file returned will be a tarball including the contents of the email (including any warning or error messages), as well as the data files.

If the parameter compress is set to false, the file will be returned uncompressed. This parameter is ignored unless numberFiles is single, and the parameter returnTarballWithEmailMessage is false (both of these being the default values).

The searchType parameter snpListForLD initiates a different mode of operation, for calculating r2 between all pairs of variations in a list. The variations must be listed by rs ID, and the displayType must be r2. A single file is always returned. The variation location columns will contain the chromosome and the position within the chromosome. The search can be constrained by populations, but not by individuals. The maximum number of variations is 20,000 for a frequency cutoff of 0; the limit is higher for finite cutoffs. There is no tracking or canceling in this mode, and simultaneous submission of more than one job with line numbers of the order of 10,000 may cause our server to fail. A job with 10,000 variations, and one HapMap population specified, takes about 1 1/2 hours. An example input file is
# fileName snpList
# headerLine snpList
# writeParametersToOutputFile true
# searchType snpListForLD
# displayType r2
# population 596
rs5901010
rs8176752
rs7469795
There is a way to automate the batch procedure. The submitted file must have a line like
	# autoFile testAuto.txt
	
where autoFile is the parameter, and the value is only used for file identification within our server (it does not need to be unique, as a time stamp is added). It's then necessary to write a screen-scraper program to submit your file, and designate a local file for writing. A Java example is given here.

Limits for Large Queries
Once a file is received, the size of the job is evaluated for time, memory, and disk space requirements. If the job is too big, it is cancelled, and an e-mail is sent asking for the job to be broken into smaller pieces and submitted sequentially. For a "genotypes" query, the limit is 10,000 SNPs per line in the file and 250,000 SNPs for the entire file (the number being calculated before any frequency cutoff is applied). For a "tagSNPs" query, it's 5,000 per line and per file is 60,000 without multipop and 30,000 with multipop. For "r2" or "r2LD", 5,000 per line and 150,000 per file is allowed. For "snpSummary", 10,000 per line and 600,000 per file are the limits. For "fastPHASE", both the line and file limits are 5,000 SNPs. If the freqCutoff parameter is greater than 0, the limits are all multiplied by a factor (freqCutoff + 4) / 4.

These limits may be changed at any time as we monitor the server load. A job will reach a timeout limit in 24 hours. If an e-mail is not received by then, there is some problem. For a large job, it's a good idea to start with a small subset and see how long the job takes, so you know about when a result is expected for the entire job. A crude monitoring system is available; it looks at the number of processed region lines in the file.

Please do not submit simultaneous large jobs.

List of All Documentation Pages
About GVS Batch

Sources of Data for GVS

How To Use GVS Batch (this page)

FAQ

Build Notes

SNP Summary Columns

Linkage Disequilibrium

Merging Populations

MultiPop-TagSelect Algorithm

List of Populations and their IDs in our Database

List of Populations and their IDs in our Database, HapMap Only

List of Individuals and their IDs in our Database

List of Individuals and their IDs in our Database, HapMap Only

 
Skip footer links and go to content
Privacy Terms National Heart, Lung, and Blood Institute National Heart, Lung, and Blood Institute logo