Catalog  |  Cart  |  Log In

 

The Coancestry Coefficient in Forensic Science

 

B.S. Weir
Program in Statistical Genetics, Department of Statistics, North Carolina State University, Raleigh NC 27695-8203.

× Ø × Ø × Ø × Ø × Ø × Ø × Ø × Ø × Ø × Ø × Ø × Ø × Ø × Ø × Ø

INTRODUCTION

When the DNA profile in a stain at the scene of a crime matches the defendant's DNA profile, the prosecution and defense will have alternative explanations for this genetic evidence E. The trial is to determine which explanation, Hp or Hd, is to be accepted. For example, the explanations may be

Hp: crime scene DNA is from defendant

Hd: crime scene DNA is not from defendant

The probabilities of the evidence under the two explanations are compared by means of the likelihood ratio

0825fig1.gif (805 bytes)

If Gs , Gc are the (matching) profile types of the suspect and the crime sample the likelihood ratio can be expressed as:

0825gif2.gif (2816 bytes)

In this last expression, the conditional probability  0825fig3.gif (344 bytes) does not reduce to the profile probability 0825fig4.gif (417 bytes) are dependent.

There is a dependency between the profiles of two people in the same subpopulation simply because the subpopulation is finite. Two people have four parents between them, eight grandparents, and so on. At x generations in the past there are 2 x 2x ancestors of two presently living people, but not all these ancestors can be distinct because 2x+1 will quickly exceed the number of people living at that time. Any two people today have common ancestors at some point in the past, and the most recent such ancestor may have been only a few tens of generations, or hundreds of years, ago.

If two people are within the same subpopulation, but frequencies pi of alleles Ai are available only for a collection of subpopulations, then the probabilities that one person has a particular genotype given that the other has been found to have that type are (Balding and Nichols, 1994):

0825fig5.gif (3050 bytes)

These are the equations referred to in Recommendation 4.2 of the 1996 NRC report (National Research Council, 1996). It should be stressed that the results hold for two people in the same subpopulation, but are an average over subpopulations. The allele frequencies pi are an average over subpopulations, and are not those in a particular subpopulation. Equations 1 therefore allow population-wide allele frequencies to be used for subpopulations for which theta applies. Given this central role of theta in the interpretation of forensic DNA evidence, it is of interest to discuss the population genetic meaning of the parameter.

MEANING OF THETA

There are several ways of describing the meaning of theta, but they all refer to the relationship of pairs of alleles within subpopulations.

Identity by descent

Alleles that descend from a single ancestral allele are said to be identical by descent, ibd, and theta can be defined as the probability that two alleles, one taken at random from each of two individuals, are ibd. The probability that an individual carries allele Ai, given that another individual in the same subpopulation has been found to carry that allele, is

0825fig6.gif (572 bytes)

This result follows from the definition of theta as an ibd probability, or it could be used to define theta. It is, however, a result that holds as an average over all replicates of a subpopulation with this value of theta.

The concept of identity by descent is a probabilistic one - even if two alleles are known to have come from the same individual, there is only a 50% chance that they are copies of the same allele carried by that individual. This random choice of alleles for transmission from parent to offspring is one of the reasons why evolution is a stochastic process. The evolutionary outcome of a particular history for a population cannot be specified exactly, but the expected outcome(s) can be described. The conditional probability statement in Equation 2 is a statement about random pairs of individuals, averaged over subpopulations

Variance among populations

A quite different approach uses the statistical concept of variance. If pis is the frequency of Ai in subpopulation s, then its expected value, or average, over subpopulations is

E[ pis] = pi

and its variance over subpopulations is (Cockerham, 1969)

0825fig7.gif (686 bytes)

Clearly, therefore, theta refers to variation among subpopulations but this is consistent with the notion of relatedness within subpopulations. If a population is divided into a number of isolated subpopulations, over time the process of genetic drift causes differences in allele frequencies to arise among the subpopulations. At the same time there is a degree or relatedness building up within the subpopulations because of the finite size of each one. The outcomes of people becoming more related within subpopulations and more distinct genetically among subpopulations are two manifestations of the same process.

The three F-statistics

An advantage of the formulation of Equation 3 is that it makes clear that estimation of theta requires data from more than one subpopulation. It is not possible to estimate variation in allele frequencies from a single observed frequency. Data from a single subpopulation, however, do allow estimation of the within-subpopulation inbreeding coefficient f. For subpopulation s, this quantity allows the frequencies of AiAi homozygotes and AiAj heterozygotes to be expressed as

0825fig8.gif (1089 bytes)

and observed genotypic frequencies within a subpopulation allow estimation of allele frequencies and inbreeding coefficient for that subpopulation. If the same value of f is assumed to hold in all subpopulations, then the average of Equation 4 over subpopulations provides the population-wide genotype frequencies Pii,, Pij:

0825fig9.gif (1683 bytes)                 (5)

where use has been made of 0825fig10.gif (992 bytes) from Equation 3. Now it is known (Cockerham, 1969) that

0825fig11.gif (393 bytes)

where F is the total inbreeding coefficient. This leads to

0825fig12.gif (1069 bytes)

The quantities F, f, theta are known as F-statistics. They are essentially equivalent to the quantities FIT, FIS, FST of Wright (1951).

Equation 5 gives the probabilities of any individual in the whole population having genotype AiAi or AiAj when pi, pj are population-wide allele frequencies. Even if each subpopulation had Hardy-Weinberg genotype frequencies, so that f = 0, any variation of allele frequencies among subpopulations will cause a departure from Hardy-Weinberg (F ¹ 0) in the whole population. This is the basis for NRC recommendation 4.1, although unfortunately that report used the symbol theta instead of F and thus blurred the essential difference between Equations 1 and 5. The first provides the conditional genotype probabilities necessary for forensic calculations, whereas the second gives single genotype probabilities for which there is unlikely to be a forensic need. The confusion in notation is unlikely to have a practical significance, because f is very low in most human populations and, therefore, F and theta are close in numerical value.

Genetic distances

With the ibd probability approach it is possible to predict the behavior of theta over time. In the absence of evolutionary forces such as selection or mutation, genetic drift in a population of size N individuals causes the value in generation t, written as theta(t), to change according to

0825fig13.gif (1187 bytes)

For large values of N and for relatively small values of t (those appropriate for the divergence times of human populations), it is usual (Weir, 1996) to take

0825fig14.gif (577 bytes)

Here then is another meaning of theta. Because it is proportional to time, it can serve as a distance between populations. If populations of size N = 100,000 diverged from each other t = 10,000 generations ago, they would have a distance of theta = 0.05 between them. This means that, within each of those populations, if account is taken of all the previous 10,000 generations, there is one chance in 20 that two alleles have a single ancestral allele. Alternatively, there is a 95% chance that they descend from different alleles among the 200,000 alleles there were 200,000 years ago. To be more accurate, N is the inbreeding effective population size, and it means that theta is increasing as though there is a probability of 1/2N of any two alleles in one generation descending from the same allele in the previous generation. For human populations that have been expanding over time, N is a harmonic mean of the effective sizes, but 100,000 is generally regarded as being a realistic size for human populations.

Hierarchical theta values

Cavalli-Sforza et al. (1994) used theta to measure distances among human populations. These authors described current understanding of the history of modern humans, with a split between Africans and non-Africans about 100,000 years ago, followed by splitting of non-Africans into Europeans and North Asians versus Southeast Asians and Australians and so on. This human history can be represented as a tree or as a hierarchical structure. There is a corresponding structure for F-statistics. Consider a four-level hierarchy (Weir, 1996) consisting of populations P, subpopulations S within populations, sub-subpopulations SS within subpopulations, and alleles A within sub-subpopulations. The relationship between pairs of alleles can be described as

thetass: alleles within the same subsubpopulation

thetas: alleles from different sub-subpopulations within the same subpopulation

thetap: alleles from different sub-subpopulations from different subpopulations within the same population

0: alleles from different sub-subpopulations from different subpopulations from different populations

The methods described in Weir (1996) can be used to estimate the three coancestry coefficients, under the assumption that alleles from different populations are unrelated (or that relationships are relative to that between populations - Cockerham, 1969). Of more interest here is the situation when data are collected from two local sub-subpopulations and used to estimate the distance between them. The analyses essentially compare the variation between the two sub-subpopulations to that within each of them. Very approximately, for one allele the quantity calculated is the square of the difference in allele frequencies divided by twice the product of the average and one minus the average. In this framework, there are three kinds of analyses:

is being estimated. This serves as measure of distance between these most closely related sub-subpopulations, and is proportional to the time since sub-subpopulations diverged from each other within subpopulations.

is being estimated. This serves as measure of distance between these next most closely related sub-subpopulations, and is proportional to the time since subpopulations diverged from each other within populations.

0825fig17.gif (708 bytes)

Although the computational details are the same for each of these three cases, it is a little misleading to use the same symbol theta regardless of how distantly related are the groups of individuals being compared.

Family relationships

The final meaning attached to theta has to do with ibd relationships between immediate family members. There is a simple counting rule to determine the probability that an allele taken from one individual, X, is ibd to an allele taken at random from a relative, Y. The family pedigree linking X and Y is examined for ancestors A they have in common. The number of people in the chain linking X,Y through A, including X,Y themselves, is counted. Call this nA . Then the coancestry coefficient of X and Y is

0825fig18.gif (883 bytes)

where the summation is over all ancestors in common to X,Y and these ancestors are assumed to be not inbred. For example, if X,Y are cousins, they have two grandparents in common. There are five people in each of the two chains linking X,Y through one of their parents to each grandparent. Therefore the coancestry coefficient is 1/16. The children of cousins are related as second cousins, and will be linked to two great-grandparents with chains of length seven. Their coancestry is 1/64 » 0.01. A person and the child of his cousin are termed first cousins once removed. There are two chains of length 6 linking those people, so they have a coancestry of 1/32 » 0.03.

What is the relationship between these family-based theta values and those that apply to random alleles from large populations? The family value for second cousins refers to the ibd relationship due to descent from the same great-grandparents. It ignores the ibd status due to common ancestors on an evolutionary time scale. Different groups of second cousins within the same population have different sets of allele frequencies because of the restriction they have to the alleles carried by 14 great-grandparents instead of the 16 great-grandparents that unrelated people have. These groups have a genetic distance of theta = 1/64 between them. The same degree of difference applies to two populations of size N = 100,000 that have been separated for t = 200,000/64 » 3,000 generations or 60,000 years. This is getting close to the degree of genetic separation between Africans and non-Africans - the largest possible difference for modern humans.

VALUES OF THETA

The most extensive compilation of human genetic data has been given by Cavalli-Sforza et al. (1994). This book

Some summary genetic distances, or "theta" values, reported by Cavalli-Sforza et al. are:


Go to proceedings home page