GVS: Genome Variation Server 144
6

Linkage Disequilibrium

Analysis of linkage disequilbrium (LD) between polymorphic sites in a locus identifies "clusters" of highly correlated sites based on the r2 LD statistic. Sites are binned into sets of highly informative markers to minimize redundant data. This data is useful for the development of a minimal set of SNPs which can be used for large-scale genotyping of similar sample populations.

We estimate LD using the method of Hill. Consider the genotypes for two diallelic SNPs: the first with alleles (A1 and A2) and second with alleles (B1 and B2). The first-SNP genotypes could be A1A1, A1A2, or A2A2; the second, B1B1, B1B2, or B2B2. There are then 9 possibilities for the combined data:

B1B1 B1B2 B2B2
A1A1 N11 N12 N13
A1A2 N21 N22 N23
A2A2 N31 N32 N33

where, in the sample of N individuals, N11 is the number of observed genotypes A1A1B1B1, etc. N is the sum of the nine N's.

The expected genotype frequencies are

B1B1 B1B2 B2B2
A1A1 P112 2P11P12 P122
A1A2 2P11P21 2P11P22+2P12P21 2P12P22
A2A2 P212 2P21P22 P222

where P11, P12, P21, and P22 are the gametic (haplotype) frequencies. P11 is the frequency for gametes with A1B1; P12, for A1B2; P21, for A2B1; P22, for A2B2; these four values are not observed, but must be determined from the observed data. Only three of the four are independent, as the sum must be equal to 1.

The formulas in the second table may be understood as follows. To get a genotype combination N11, each of the two chromosomes must have the haplotype A1B1. The probability is P11 x P11. To get N12, one chromosome must have A1B1 and the other A1B2. As there are two ways to do this (whether the first chromosome gets A1B1 or A1B2), and the probability is 2P11P12. To get N22, there are four possibilities: the first chromosome could be A1B1 and the second A2B2; the first A1B2 and the second A2B1; the first A2B1 and the second A1B2; the first A2B2 and the second A1B1. The sum of these four gives a probability of P11P22+P12P21+ P21P12+P22P11. Other combinations follow one of these patterns.

Within a constant of proportionality of 2N, the value of each cell of the first table should be approximately equal to that of the corresponding cell in the second table. However, it is unlikely that any choice of the three independent P's will exactly describe the larger number of observed genotypes. A statistical process is being considered: N individuals are selected from a much larger population that is described by the gametic frequencies. We then ask what values of the four P's (constrained to sum to 1) give the most probable result for the observed N's.

This can be done by a method known as gene counting. Consider what we know about the gametes for each of the 9 combined-data possibilities.

B1B1 B1B2 B2B2
A1A1
A1B1
A1B1
A1B1
A1B2
A1B2
A1B2
A1A2
A1B1
A2B1
A1B1
A2B2
or
A1B2
A2B1
A1B2
A2B2
A2A2
A2B1
A2B1
A2B1
A2B2
A2B2
A2B2


Only in the center cell of the table is there an ambiguity in what the gametes are. The two choices can be expected to occur in the proportions 2P11P22 / (2P11P22+2P12P21) for the top set of gametes and 2P12P21 / (2P11P22+2P12P21) for the bottom set.

By counting the gametes, we get the estimated frequencies

P11 = [2N11 + N12 + N21 + N22P11P22/(P11P22+P12P21)] / 2N

P12 = [N12 + 2N13 + N23 + N22P12P21/(P11P22+P12P21)] / 2N

P21 = [N21 + 2N31 + N32 + N22P12P21/(P11P22+P12P21)] / 2N

P22 = [N23 + N32 + 2N33 + N22P11P22/(P11P22+P12P21)] / 2N

The allele frequencies for the first allele, pA1 for the first locus, and pB1 for the second, are

pA1 = P11 + P12 (always 1 for the first subscript, but summed over the second)
pB1 = P11 + P21 (always 1 for the second subscript, but summed over the first)

The linkage disequilibrium parameter D is defined as

D = P11P22 - P12P21

D measures the extent to which the alleles for the A locus are correlated with those from B locus. If the alleles for the two loci are independent (as would be expected if they are on different chromosomes or are located far apart), D = 0.

Substituting in

P12 = pA1 - P11
P21 = pB1 - P11
P22 = 1 - P11 - P12 - P21 = 1 - pA1 - pB1 + P11

gives an alternative expression for D:

D = P11 - pA1pB1

As estimates for pA1 and pB1 can be obtained from the allele frequencies at the individual sites, it remains to get an estimate for P11. With the same three substitutions for P12, P21, P22, the above equation for P11 becomes
P11 =
2N11 + N12 + N21 + N22P11(1 - pA1 - pB1 + P11)
2N [P11(1 - pA1 - pB1 + P11) + (pA1 - P11)(pB1 - P11)]

This is a cubic equation for P11. We solve it by making an initial estimate for P11, substituting it into the right-hand side of the equation, getting a new P11, and iteratively substituting into the right-hand side, until the result converges. The initial estimate is taken as pA1pB1.

The r2 values we use are the square of the Pearson correlation coefficients, and are related to D by
r2 =
D2
pA1(1-pA1)pB1(1-pB1)

References

W. G. Hill (1974) Estimation of linkage disequilibrium in randomly mating populations. Heredity 33:229. Note that there are a couple of errors in the equations. These have been corrected in D. L. Hartl and A. G. Clark (1989) Principles of Population Genetics, second edition, Sinauer Associates, Inc., Sunderland, Massachusetts, p. 56.
 
Skip footer links and go to content
Privacy Terms National Heart, Lung, and Blood Institute National Heart, Lung, and Blood Institute logo