6


Linkage Disequilibrium
Analysis of linkage disequilbrium (LD) between polymorphic sites in a locus identifies "clusters" of highly correlated sites based
on the r^{2} LD statistic. Sites are binned into sets of highly informative markers to minimize redundant data.
This data is useful for the development of a minimal set of SNPs which can be used for largescale genotyping of similar sample populations.
We estimate LD using the method of Hill. Consider the genotypes for two diallelic SNPs: the first with alleles (A_{1} and A_{2})
and second with alleles (B_{1} and B_{2}). The firstSNP genotypes could be A_{1}A_{1}, A_{1}A_{2},
or A_{2}A_{2}; the second, B_{1}B_{1}, B_{1}B_{2}, or B_{2}B_{2}.
There are then 9 possibilities for the combined data:

B_{1}B_{1} 
B_{1}B_{2} 
B_{2}B_{2} 
A_{1}A_{1} 
N_{11} 
N_{12} 
N_{13} 
A_{1}A_{2} 
N_{21} 
N_{22} 
N_{23} 
A_{2}A_{2} 
N_{31} 
N_{32} 
N_{33} 
where, in the sample of N individuals, N_{11} is the number of observed genotypes A_{1}A_{1}B_{1}B_{1}, etc.
N is the sum of the nine N's.
The expected genotype frequencies are

B_{1}B_{1} 
B_{1}B_{2} 
B_{2}B_{2} 
A_{1}A_{1} 
P_{11}^{2} 
2P_{11}P_{12} 
P_{12}^{2} 
A_{1}A_{2} 
2P_{11}P_{21} 
2P_{11}P_{22}+2P_{12}P_{21} 
2P_{12}P_{22} 
A_{2}A_{2} 
P_{21}^{2} 
2P_{21}P_{22} 
P_{22}^{2} 
where P_{11}, P_{12}, P_{21}, and P_{22} are the gametic (haplotype) frequencies. P_{11} is the frequency
for gametes with A_{1}B_{1}; P_{12}, for A_{1}B_{2}; P_{21}, for A_{2}B_{1};
P_{22}, for A_{2}B_{2}; these four values are not observed, but must be determined from the observed data. Only three of
the four are independent, as the sum must be equal to 1.
The formulas in the second table may be understood as follows. To get a genotype combination N_{11}, each of the two chromosomes must have the haplotype
A_{1}B_{1}. The probability is P_{11} x P_{11}. To get N_{12}, one chromosome must have
A_{1}B_{1} and the other A_{1}B_{2}. As there are two ways to do this (whether the first chromosome gets A_{1}B_{1}
or A_{1}B_{2}), and the probability is 2P_{11}P_{12}. To get N_{22}, there are four possibilities: the first
chromosome could be A_{1}B_{1} and the second A_{2}B_{2}; the first A_{1}B_{2} and the second
A_{2}B_{1}; the first A_{2}B_{1} and the second A_{1}B_{2}; the first A_{2}B_{2} and
the second A_{1}B_{1}. The sum of these four gives a probability of P_{11}P_{22}+P_{12}P_{21}+
P_{21}P_{12}+P_{22}P_{11}. Other combinations follow one of these patterns.
Within a constant of proportionality of 2N, the value of each cell of the first table should
be approximately equal to that of the corresponding cell in the second table. However, it is unlikely that any choice of the three independent P's
will exactly describe the larger number of observed genotypes. A statistical process is being considered: N individuals are selected from a much larger
population that is described by the gametic frequencies. We then ask what values of the four P's (constrained to sum to 1) give the most probable
result for the observed N's.
This can be done by a method known as gene counting. Consider what we know about the gametes for each of the 9 combineddata possibilities.

B_{1}B_{1} 
B_{1}B_{2} 
B_{2}B_{2} 
A_{1}A_{1} 



A_{1}A_{2} 

A_{1}B_{1}

A_{2}B_{2}

or

A_{1}B_{2}

A_{2}B_{1}



A_{2}A_{2} 



Only in the center cell of the table is there an ambiguity in what the gametes are. The two choices can be expected to occur
in the proportions 2P_{11}P_{22} / (2P_{11}P_{22}+2P_{12}P_{21}) for the top set of
gametes and 2P_{12}P_{21} / (2P_{11}P_{22}+2P_{12}P_{21}) for the bottom set.
By counting the gametes, we get the estimated frequencies
P_{11} = [2N_{11} + N_{12} + N_{21} + N_{22}P_{11}P_{22}/(P_{11}P_{22}+P_{12}P_{21})] / 2N
P_{12} = [N_{12} + 2N_{13} + N_{23} + N_{22}P_{12}P_{21}/(P_{11}P_{22}+P_{12}P_{21})] / 2N
P_{21} = [N_{21} + 2N_{31} + N_{32} + N_{22}P_{12}P_{21}/(P_{11}P_{22}+P_{12}P_{21})] / 2N
P_{22} = [N_{23} + N_{32} + 2N_{33} + N_{22}P_{11}P_{22}/(P_{11}P_{22}+P_{12}P_{21})] / 2N
The allele frequencies for the first allele, p_{A1} for the first locus, and p_{B1} for the second, are
p_{A1} = P_{11} + P_{12} (always 1 for the first subscript, but summed over the second)
p_{B1} = P_{11} + P_{21} (always 1 for the second subscript, but summed over the first)
The linkage disequilibrium parameter D is defined as
D = P_{11}P_{22}  P_{12}P_{21}
D measures the extent to which the alleles for the A locus are correlated with those from B locus. If the alleles for the two loci
are independent (as would be expected if they are on different chromosomes or are located far apart), D = 0.
Substituting in
P_{12} = p_{A1}  P_{11}
P_{21} = p_{B1}  P_{11}
P_{22} = 1  P_{11}  P_{12}  P_{21} = 1  p_{A1}  p_{B1} + P_{11}
gives an alternative expression for D:
D = P_{11}  p_{A1}p_{B1}
As estimates for p_{A1} and p_{B1} can be obtained from the allele frequencies at the individual sites, it remains to get an estimate for P_{11}.
With the same three substitutions for P_{12}, P_{21}, P_{22}, the above equation for P_{11} becomes
P_{11} =

2N_{11} + N_{12} + N_{21} + N_{22}P_{11}(1  p_{A1}  p_{B1} + P_{11})

2N [P_{11}(1  p_{A1}  p_{B1} + P_{11}) + (p_{A1}  P_{11})(p_{B1}  P_{11})]

This is a cubic equation for P_{11}. We solve it by making an initial estimate for P_{11}, substituting it into the righthand
side of the equation, getting a new P_{11}, and iteratively substituting into the righthand side, until the result converges. The initial estimate is taken as
p_{A1}p_{B1}.
The r^{2} values we use are the square of the Pearson correlation coefficients, and are related to D by
r^{2} =

D^{2}

p_{A1}(1p_{A1})p_{B1}(1p_{B1})

References
W. G. Hill (1974) Estimation of linkage disequilibrium in randomly mating populations. Heredity 33:229. Note that there are a couple of errors in the equations.
These have been corrected in D. L. Hartl and A. G. Clark (1989) Principles of Population Genetics, second edition, Sinauer Associates, Inc.,
Sunderland, Massachusetts, p. 56.





