|
|
 |
Linkage Disequilibrium
Analysis of linkage disequilbrium (LD) between polymorphic sites in a locus identifies "clusters" of highly correlated sites based
on the r2 LD statistic. Sites are binned into sets of highly informative markers to minimize redundant data.
This data is useful for the development of a minimal set of SNPs which can be used for large-scale genotyping of similar sample populations.
We estimate LD using the method of Hill. Consider the genotypes for two diallelic SNPs: the first with alleles (A1 and A2)
and second with alleles (B1 and B2). The first-SNP genotypes could be A1A1, A1A2,
or A2A2; the second, B1B1, B1B2, or B2B2.
There are then 9 possibilities for the combined data:
|
B1B1 |
B1B2 |
B2B2 |
| A1A1 |
N11 |
N12 |
N13 |
| A1A2 |
N21 |
N22 |
N23 |
| A2A2 |
N31 |
N32 |
N33 |
where, in the sample of N individuals, N11 is the number of observed genotypes A1A1B1B1, etc.
N is the sum of the nine N's.
The expected genotype frequencies are
|
B1B1 |
B1B2 |
B2B2 |
| A1A1 |
P112 |
2P11P12 |
P122 |
| A1A2 |
2P11P21 |
2P11P22+2P12P21 |
2P12P22 |
| A2A2 |
P212 |
2P21P22 |
P222 |
where P11, P12, P21, and P22 are the gametic (haplotype) frequencies. P11 is the frequency
for gametes with A1B1; P12, for A1B2; P21, for A2B1;
P22, for A2B2; these four values are not observed, but must be determined from the observed data. Only three of
the four are independent, as the sum must be equal to 1.
The formulas in the second table may be understood as follows. To get a genotype combination N11, each of the two chromosomes must have the haplotype
A1B1. The probability is P11 x P11. To get N12, one chromosome must have
A1B1 and the other A1B2. As there are two ways to do this (whether the first chromosome gets A1B1
or A1B2), and the probability is 2P11P12. To get N22, there are four possibilities: the first
chromosome could be A1B1 and the second A2B2; the first A1B2 and the second
A2B1; the first A2B1 and the second A1B2; the first A2B2 and
the second A1B1. The sum of these four gives a probability of P11P22+P12P21+
P21P12+P22P11. Other combinations follow one of these patterns.
Within a constant of proportionality of 2N, the value each cell of the first table should
be approximately equal to that of the corresponding cell in the second table. However, it is unlikely that any choice of the three independent P's
will exactly describe the larger number of observed genotypes. A statistical process is being considered: N individuals are selected from a much larger
population that is described by the gametic frequencies. We then ask what values of the four P's (constrained to sum to 1) give the most probable
result for the observed N's.
This can be done by a method known as gene counting. Consider what we know about the gametes for each of the 9 combined-data possibilities.
|
B1B1 |
B1B2 |
B2B2 |
| A1A1 |
|
|
|
| A1A2 |
|
|
|
| A2A2 |
|
|
|
Only in the center cell of the table is there an ambiguity in what the gametes are. The two choices can be expected to occur
in the proportions 2P11P22 / (2P11P22+2P12P21) for the top set of
gametes and 2P12P21 / (2P11P22+2P12P21) for the bottom set.
By counting the gametes, we get the estimated frequencies
P11 = [2N11 + N12 + N21 + N22P11P22/(P11P22+P12P21)] / 2N
P12 = [N12 + 2N13 + N23 + N22P12P21/(P11P22+P12P21)] / 2N
P21 = [N21 + 2N31 + N32 + N22P12P21/(P11P22+P12P21)] / 2N
P22 = [N23 + N32 + 2N33 + N22P11P22/(P11P22+P12P21)] / 2N
The allele frequencies for the first allele, pA1 for the first locus, and pB1 for the second, are
pA1 = P11 + P12 (always 1 for the first subscript, but summed over the second)
pB1 = P11 + P21 (always 1 for the second subscript, but summed over the first)
The linkage disequilibrium parameter D is defined as
D = P11P22 - P12P21
D measures the extent to which the alleles for the A locus are correlated with those from B locus. If the alleles for the two loci
are independent (as would be expected if they are on different chromosomes or are located far apart), D = 0.
Substituting in
P12 = pA1 - P11
P21 = pB1 - P11
P22 = 1 - P11 - P12 - P21 = 1 - pA1 - pB1 + P11
gives an alternative expression for D:
D = P11 - pA1pB1
As estimates for pA1 and pB1 can be obtained from the allele frequencies at the individual sites, it remains to get an estimate for P11.
With the same three substitutions for P12, P21, P22, the above equation for P11 becomes
|
P11 =
|
2N11 + N12 + N21 + N22P11(1 - pA1 - pB1 + P11)
|
|
2N [P11(1 - pA1 - pB1 + P11) + (pA1 - P11)(pB1 - P11)]
|
This is a cubic equation for P11. We solve it by making an initial estimate for P11, substituting it into the right-hand
side of the equation, getting a new P11, and iteratively substituting into the right-hand side, until the result coverges. The initial estimate is taken as
pA1pB1.
The r2 values we use are the square of the Pearson correlation coefficients, and are related to D by
|
r2 =
|
D2
|
|
pA1(1-pA1)pB1(1-pB1)
|
References
W. G. Hill (1974) Estimation of linkage disequilibrium in randomly mating populations. Heredity 33:229. Note that there are a couple of errors in the equations.
These have been corrected in D. L. Hartl and A. G. Clark (1989) Principles of Population Genetics, second edition, Sinauer Associates, Inc.,
Sunderland, Massachusetts, p. 56.
|
|
|
|
 |
|