The Coancestry Coefficient in Forensic Science
B.S. Weir
Program in Statistical Genetics, Department of Statistics, North Carolina State
University, Raleigh NC 27695-8203.
× Ø × Ø × Ø × Ø × Ø × Ø × Ø × Ø × Ø × Ø × Ø × Ø × Ø × Ø × Ø
INTRODUCTION
When the DNA profile in a stain at the scene of a crime matches the defendant's DNA profile, the prosecution and defense will have alternative explanations for this genetic evidence E. The trial is to determine which explanation, Hp or Hd, is to be accepted. For example, the explanations may be
Hp: crime scene DNA is from defendant
Hd: crime scene DNA is not from defendant
The probabilities of the evidence under the two explanations are compared by means of the likelihood ratio

If Gs , Gc are the (matching) profile types of the suspect and the crime sample the likelihood ratio can be expressed as:

In this last expression, the conditional probability
does not reduce to the profile
probability
are dependent.
There is a dependency between the profiles of two people in the same subpopulation simply because the subpopulation is finite. Two people have four parents between them, eight grandparents, and so on. At x generations in the past there are 2 x 2x ancestors of two presently living people, but not all these ancestors can be distinct because 2x+1 will quickly exceed the number of people living at that time. Any two people today have common ancestors at some point in the past, and the most recent such ancestor may have been only a few tens of generations, or hundreds of years, ago.
If two people are within the same subpopulation, but frequencies pi of alleles Ai are available only for a collection of subpopulations, then the probabilities that one person has a particular genotype given that the other has been found to have that type are (Balding and Nichols, 1994):
These are the equations referred to in Recommendation 4.2 of the 1996 NRC report (National Research Council, 1996). It should be stressed that the results hold for two people in the same subpopulation, but are an average over subpopulations. The allele frequencies pi are an average over subpopulations, and are not those in a particular subpopulation. Equations 1 therefore allow population-wide allele frequencies to be used for subpopulations for which theta applies. Given this central role of theta in the interpretation of forensic DNA evidence, it is of interest to discuss the population genetic meaning of the parameter.
MEANING OF THETA
There are several ways of describing the meaning of theta, but they all refer to the relationship of pairs of alleles within subpopulations.
Identity by descent
Alleles that descend from a single ancestral allele are said to be identical by descent, ibd, and theta can be defined as the probability that two alleles, one taken at random from each of two individuals, are ibd. The probability that an individual carries allele Ai, given that another individual in the same subpopulation has been found to carry that allele, is

This result follows from the definition of theta as an ibd probability, or it could be used to define theta. It is, however, a result that holds as an average over all replicates of a subpopulation with this value of theta.
The concept of identity by descent is a probabilistic one - even if two alleles are known to have come from the same individual, there is only a 50% chance that they are copies of the same allele carried by that individual. This random choice of alleles for transmission from parent to offspring is one of the reasons why evolution is a stochastic process. The evolutionary outcome of a particular history for a population cannot be specified exactly, but the expected outcome(s) can be described. The conditional probability statement in Equation 2 is a statement about random pairs of individuals, averaged over subpopulations
Variance among populations
A quite different approach uses the statistical concept of variance. If pis is the frequency of Ai in subpopulation s, then its expected value, or average, over subpopulations is
E[ pis] = pi
and its variance over subpopulations is (Cockerham, 1969)

Clearly, therefore, theta refers to variation among subpopulations but this is consistent with the notion of relatedness within subpopulations. If a population is divided into a number of isolated subpopulations, over time the process of genetic drift causes differences in allele frequencies to arise among the subpopulations. At the same time there is a degree or relatedness building up within the subpopulations because of the finite size of each one. The outcomes of people becoming more related within subpopulations and more distinct genetically among subpopulations are two manifestations of the same process.
The three F-statistics
An advantage of the formulation of Equation 3 is that it makes clear that estimation of theta requires data from more than one subpopulation. It is not possible to estimate variation in allele frequencies from a single observed frequency. Data from a single subpopulation, however, do allow estimation of the within-subpopulation inbreeding coefficient f. For subpopulation s, this quantity allows the frequencies of AiAi homozygotes and AiAj heterozygotes to be expressed as

and observed genotypic frequencies within a subpopulation allow estimation of allele frequencies and inbreeding coefficient for that subpopulation. If the same value of f is assumed to hold in all subpopulations, then the average of Equation 4 over subpopulations provides the population-wide genotype frequencies Pii,, Pij:
(5)
where use has been made of
from Equation 3. Now it is known (Cockerham, 1969) that

where F is the total inbreeding coefficient. This leads to
The quantities F, f, theta are known as F-statistics. They are essentially equivalent to the quantities FIT, FIS, FST of Wright (1951).
Equation 5 gives the probabilities of any individual in the whole population having genotype AiAi or AiAj when pi, pj are population-wide allele frequencies. Even if each subpopulation had Hardy-Weinberg genotype frequencies, so that f = 0, any variation of allele frequencies among subpopulations will cause a departure from Hardy-Weinberg (F ¹ 0) in the whole population. This is the basis for NRC recommendation 4.1, although unfortunately that report used the symbol theta instead of F and thus blurred the essential difference between Equations 1 and 5. The first provides the conditional genotype probabilities necessary for forensic calculations, whereas the second gives single genotype probabilities for which there is unlikely to be a forensic need. The confusion in notation is unlikely to have a practical significance, because f is very low in most human populations and, therefore, F and theta are close in numerical value.
Genetic distances
With the ibd probability approach it is possible to predict the behavior of theta over time. In the absence of evolutionary forces such as selection or mutation, genetic drift in a population of size N individuals causes the value in generation t, written as theta(t), to change according to

For large values of N and for relatively small values of t (those appropriate for the divergence times of human populations), it is usual (Weir, 1996) to take
![]()
Here then is another meaning of theta. Because it is proportional to time, it can serve as a distance between populations. If populations of size N = 100,000 diverged from each other t = 10,000 generations ago, they would have a distance of theta = 0.05 between them. This means that, within each of those populations, if account is taken of all the previous 10,000 generations, there is one chance in 20 that two alleles have a single ancestral allele. Alternatively, there is a 95% chance that they descend from different alleles among the 200,000 alleles there were 200,000 years ago. To be more accurate, N is the inbreeding effective population size, and it means that theta is increasing as though there is a probability of 1/2N of any two alleles in one generation descending from the same allele in the previous generation. For human populations that have been expanding over time, N is a harmonic mean of the effective sizes, but 100,000 is generally regarded as being a realistic size for human populations.
Hierarchical theta values
Cavalli-Sforza et al. (1994) used theta to measure distances among human populations. These authors described current understanding of the history of modern humans, with a split between Africans and non-Africans about 100,000 years ago, followed by splitting of non-Africans into Europeans and North Asians versus Southeast Asians and Australians and so on. This human history can be represented as a tree or as a hierarchical structure. There is a corresponding structure for F-statistics. Consider a four-level hierarchy (Weir, 1996) consisting of populations P, subpopulations S within populations, sub-subpopulations SS within subpopulations, and alleles A within sub-subpopulations. The relationship between pairs of alleles can be described as
theta
thetas: alleles from different sub-subpopulations within the same subpopulation
thetap: alleles from different sub-subpopulations from different subpopulations within the same population
0: alleles from different sub-subpopulations from different subpopulations from different populations
The methods described in Weir (1996) can be used to estimate the three coancestry coefficients, under the assumption that alleles from different populations are unrelated (or that relationships are relative to that between populations - Cockerham, 1969). Of more interest here is the situation when data are collected from two local sub-subpopulations and used to estimate the distance between them. The analyses essentially compare the variation between the two sub-subpopulations to that within each of them. Very approximately, for one allele the quantity calculated is the square of the difference in allele frequencies divided by twice the product of the average and one minus the average. In this framework, there are three kinds of analyses:
is being estimated. This serves as measure of distance between these most closely related sub-subpopulations, and is proportional to the time since sub-subpopulations diverged from each other within subpopulations.
is being estimated. This serves as measure of distance between these next most closely related sub-subpopulations, and is proportional to the time since subpopulations diverged from each other within populations.
![]()
Although the computational details are the same for each of these three cases, it is a little misleading to use the same symbol theta regardless of how distantly related are the groups of individuals being compared.
Family relationships
The final meaning attached to theta has to do with ibd relationships between immediate family members. There is a simple counting rule to determine the probability that an allele taken from one individual, X, is ibd to an allele taken at random from a relative, Y. The family pedigree linking X and Y is examined for ancestors A they have in common. The number of people in the chain linking X,Y through A, including X,Y themselves, is counted. Call this nA . Then the coancestry coefficient of X and Y is

where the summation is over all ancestors in common to X,Y and these ancestors are assumed to be not inbred. For example, if X,Y are cousins, they have two grandparents in common. There are five people in each of the two chains linking X,Y through one of their parents to each grandparent. Therefore the coancestry coefficient is 1/16. The children of cousins are related as second cousins, and will be linked to two great-grandparents with chains of length seven. Their coancestry is 1/64 » 0.01. A person and the child of his cousin are termed first cousins once removed. There are two chains of length 6 linking those people, so they have a coancestry of 1/32 » 0.03.
What is the relationship between these family-based theta values and those that apply to random alleles from large populations? The family value for second cousins refers to the ibd relationship due to descent from the same great-grandparents. It ignores the ibd status due to common ancestors on an evolutionary time scale. Different groups of second cousins within the same population have different sets of allele frequencies because of the restriction they have to the alleles carried by 14 great-grandparents instead of the 16 great-grandparents that unrelated people have. These groups have a genetic distance of theta = 1/64 between them. The same degree of difference applies to two populations of size N = 100,000 that have been separated for t = 200,000/64 » 3,000 generations or 60,000 years. This is getting close to the degree of genetic separation between Africans and non-Africans - the largest possible difference for modern humans.
VALUES OF THETA
The most extensive compilation of human genetic data has been given by Cavalli-Sforza et al. (1994). This book
Some summary genetic distances, or "theta" values, reported by Cavalli-Sforza et al. are:
These values were based on protein variants. A survey of current forensic marker data has been prepared by B.S. Weir et al. (in preparation) and shows that:
Bearing in mind that Equation 1 is intended for individuals within the same subpopulation of a single racial group, the NRC recommendation that theta should be set to 0.01 or 0.03 is seen to be quite conservative. These recommendations are equivalent to assuming that a single racial group consists of isolated groups of second cousins or first cousins once removed.
EFFECTS OF THETA
What effect does allowing for population substructure have on forensic calculations? It requires that the conditional profile probabilities in Equation 1 be used instead of simple product-rule profile probabilities.
For heterozygotes between alleles with equal frequencies p, the LR (i.e. the reciprocal of the conditional probabilities) for various values of theta are:
theta = 0 |
theta = 0.001 |
theta = 0.01 |
theta = 0.05 |
|
p = 0.01 |
5,000 |
4,152 |
1,295 |
153 |
p = 0.05 |
200 |
193 |
145 |
58 |
p = 0.10 |
50 |
49 |
43 |
27 |
The effects of theta decrease as allele frequencies increase and are not substantial when p = 0.1 even for theta as high as 0.01.
For homozygotes, the ratios of the conditional probabilities to the simple product-rule probabilities are

and some numerical values are
theta = 0.01 |
theta = 0.03 |
|
p = 0.01 |
11.6 |
67.4 |
p = 0.10 |
1.49 |
2.22 |
p = 0.50 |
1.04 |
1.14 |
The NRC recommended value of theta = 0.03 causes a halving per locus of the numerical weight of matching homozygotes when p = 0.1.
ACKNOWLEDGEMENTS
This work was supported in part by NIH grant GM45344. I am grateful for the assistance of Dr. James Curran.
REFERENCES
Go to proceedings home page