STRBase: A Short Tandem Repeat DNA Internet-Accessible Database
John M. Butler§, Christian M. Ruitberg, and Dennis J. Reeder
National Institute of Standards and Technology, Biotechnology Division, Gaithersburg,
MD 20899
§Current address: GeneTrace Systems, Inc., 333 Ravenswood Avenue, Menlo Park,
CA 94025
× Ø × Ø × Ø × Ø × Ø × Ø × Ø × Ø × Ø × Ø × Ø × Ø × Ø × Ø × Ø
ABSTRACT
An internet-accessible database containing information about commonly used short tandem repeat (STR) DNA markers is now available to benefit research and application of STRs to human identity testing. A comprehensive listing of over 500 references focused on STRs are compiled in STRBase. Facts and sequence information on STR systems are included along with population data, common multiplex STR systems, and published PCR primers. In addition, STRBase contains a review of various technologies for resolving STR alleles as well as a summary of validation studies performed on various STR loci. Hyperlinks to paternity testing laboratory web sites and forensic organizations are also a part of this database. STRBase may be accessed via the World Wide Web at http://ibm4.carb.nist.gov:8800/dna/home.htm.
INTRODUCTION
The polymorphic nature of tandemly repeated DNA sequences that are widespread throughout the human genome have made them important genetic markers for gene mapping studies, linkage analysis, and human identity testing (1). While there are literally hundreds of STR systems that have been mapped throughout the human genome (2), only a few dozen STR loci have been investigated for application to human identity testing (3-6). These STR loci are found on almost every chromosome in the genome and may be amplified using a variety of polymerase chain reaction (PCR) primers. Tetranucleotide repeats have been most popular among forensic scientists due to their fidelity in PCR amplification, although some tri- and pentanucleotide repeats are also in use (1,5). Desirable features for STR systems include a high heterozygosity, a regular repeat unit, distinguishable alleles, and capability for robust amplification using PCR (7).
While the use of STRs for genetic mapping and identity testing has become widespread among DNA typing laboratories, there is no single place where information may be found regarding STR systems. The literature on DNA typing by STR analysis is spread out over hundreds of papers spanning the last six or eight years. A busy forensic scientist entering the new world of STR typing could easily feel overwhelmed. In addition, the nomenclature for allele designation often differs among laboratories, making comparison studies difficult at best. One example of variation between laboratories is in the repeat structure nomenclature for the STR locus HUMTH01. The Forensic Science Service designates the repeat structure as TCAT (4,5) while the Promega Corporation lists the repeat as AATG (8). This difference may be rectified by an examination of the GenBank sequence for HUMTH01 (http://www2.ncbi.nlm.nih.gov/cgi-bin/genbank; D00269). TCAT is the first complete repeat on the top strand while AATG is the first full repeat on the bottom strand. Having access to sequence information for STR loci makes them easier to understand and can resolve apparent conflicts.
During the past year, with a goal of making future work with STRs and human identity testing easier, we began work on an STR review paper. We wanted to bring together the abundant literature on the subject in a cohesive fashion. The compilation of material regarding sequence information, observed alleles, and primer sequences quickly grew to a size that no respectable journal would consider publishing nor would anyone have time to read it in its entirety. We also realized that in a field that is developing as rapidly as DNA typing, a dynamic database would have greater value to the scientific and legal community rather than a static document. New information could constantly be entered into a database, whereas a published review paper merely marked the time when it was published and could only represent the past. The internet, and particularly the World Wide Web, with its ability to link information within a document and to be reached by virtually anyone in the world became the focus of our attention. The material which had been gathered to prepare the review paper was translated into hypertext markup language (HTML) using Microsoft® FrontPage 97and STRBase was born (Figure 1).
CONTENT OF STRBase
STR Fact Sheets
At the heart of STRBase is information describing each commonly used STR DNA marker. These "STR Fact Sheets" contain information on observed alleles (including reported microvariants) with their repeat structure and their PCR product sizes using published primers (Figure 2). Published reports of repeat sequence structures are followed or are converted to a common nomenclature. In almost all cases, we use the top strand from GenBank® for allele nomenclature and repeat sequence information. Each primer sequence or new reported allele is referenced to a comprehensive STR reference listing (see reference listing section below). Hyperlinks are included to GenBank® (9) making the entire sequence for the STR locus readily accessible. The number of repeats in the GenBank® sequence is also noted for each locus. Additionally, a list of population studies and references specific to the STR locus of interest may be reached via hyperlinks. Multiplex sets of STRs (see section below) and commercial sources for allelic ladders are hyperlinked as well. STR Fact Sheets may also be reached through a hyperlinked chromosomal index of identity testing markers (Figure 3). A listing of original papers for each STR locus in also part of STRBase. As of September 1997, STR Fact Sheets are available on STRBase for the following loci: CD4, CSF1PO, D5S818, D7S820, D12S391, D13S317, D16S539, DYS19, F13A1, F13B, FES/FPS, FGA, HPRTB, HUMTH01, LPL, TPOX, and VWA.
Sequence Information
Sequence information from GenBank® (http://www2.ncbi.nlm.nih.gov/cgi-bin/genbank) for commonly used STR loci is accessible through STRBase. Using the accession number for each STR locus, the GenBank® sequence was downloaded from the internet and then trimmed to include the repeat region and the flanking regions with previously reported primer sequences. In the JPEG sequence files found in STRBase, the repeat region and the published primer sequences are annotated (Figure 4). For example, the top strand repeat in HUMTH01 of TCAT and the bottom strand of AATG are both highlighted (see Fig. 4). The smaller primer set, developed by the Forensic Science Service (4), produces a 170 bp PCR product for allele 9 (the GenBank® sequence) and the larger primer set, developed by Caskeys group (1,3), generates a 195 bp amplicon.
It is important to point out that the repeat structures and even the number of repeats may differ depending on whether the top or bottom strand (as found in GenBank®) is used in the allele designation (Table 1). We recommend using the top strand from GenBank® for designating the repeat sequence of an STR marker as GenBank® is publicly available, and this allows for a common nomenclature.
Population Data
Laboratories expend a great deal of time and effort to conduct studies on allele frequencies for various STR systems with populations that are of interest to their lab. There are over 750 population studies which have been published and now brought together in one place for the first time through STRBase. Most of the studies contain a sample size of greater than 100 samples, which is usually sufficient to make reliable projections about a genotypes frequency in a larger population (19).
Published population studies have been documented in STRBase by listing the STR system, the population examined, the number of unrelated individuals tested, and the reference. This index of STR population studies should be valuable for locating references that contain useful allele frequencies to aid in calculating matching probabilities for DNA typing cases. In addition to the complete list of 750 population studies, population data has been sorted by STR locus and is available for CD4, CSF1PO, D21S11, DYS19, F13A1, F13B, FES/FPS, FGA, HPRTB, HUMTH01, LPL, SE33, TPOX, and VWA. Unfortunately, copyright laws prohibit including complete allele frequency information from previously published work in STRBase.
Validation Studies on STR Loci
Before a new STR system or STR multiplex may be routinely employed in human identity testing, it should be extensively validated to ensure reliability of results. STRBase contains summaries of many of the validation studies which have appeared in the literature as well as a listing of the TWGDAM guidelines for validation of PCR-based DNA typing markers. These materials should help future researchers design their validation studies and allow those who use common STR systems to review what studies have been performed previously.
Technology for Separation of STR Alleles
The constant development of new technologies for DNA analysis makes it difficult for an individual to keep up with or sometimes even understand some of the new methodologies. We have included a brief review of techniques that have been successfully applied to resolving and detecting STR alleles. The established techniques of polyacrylamide gel electrophoresis with silver staining, fluorescent scanning, and automated fluorescent detection systems are discussed along with the emerging methods of capillary electrophoresis, capillary array electrophoresis, microchip CE analysis, and time-of-flight mass spectrometry. Pertinent references in the literature for each technology and hyperlinks to groups working in each area are also included.
Published PCR Primers
PCR relies upon the binding characteristics of the oligonucleotide primers used to define the targeted region of DNA. A great deal of effort is often expended to design primers which bind specifically to the region of interest and which do not form primer dimers. Since the positions of the primers define the size of the PCR products, primers are sometimes redesigned when STR multiplexes are prepared. Different groups have often designed unique primers for the same STR loci. The STR Fact Sheets take this into account by listing each primer set with the subsequent PCR product sizes. In this section, we have broken up the published PCR primers into two groups: the Forensic Science Service and Caskeys group at Baylor (many of which have been adopted by Promega). It should be remembered that published primers are not necessarily those used by commercial sources for multiplex amplification.
STR Multiplexes
Many more STR multiplexes have been reported in the literature than are commercially available. Multiplexes are listed with their common name (i.e., PowerPlex
Ô or AmpFlSTRÔ Blue), the STR loci used, comments about fluorescent labels, and the reference. Commerically available STR multiplexes are hyperlinked to the vendor.Links to Other Web Sites
Hyperlinks are also made to organizations involved in DNA typing, commercial sources of instrumentation or DNA testing kits, paternity testing laboratories, electronic journals where STR publications have been found, and other useful DNA databases, such as GenBank® and the Genome Data Base. Table 2 includes some of the web site addresses which contain information on STR markers or may be of interest to forensic scientists.
Reference Listing
References from journals, conference proceedings, and book chapters were gathered and entered into Reference Manager. Over 500 references pertaining to STRs and their application to DNA typing are listed in STRBase. These references come from almost 50 sources including many conference proceedings and several book chapters (TABLE 3).
Other Information
We have a section of STRBase containing information on Y-chromosome STRs, which may prove useful for rape cases and paternity testing. The known alleles and PCR product sizes for 10 Y-chromosome STRs are listed as described in several papers (10-13). DYS19, DXYS156, DYS287, DYS385, DYS389I, DYS389II, DYS390, DYS391, DYS392, and DYS393 have 58 reported possible alleles.
Various PCR-based sex-typing assays have been reported in the literature. The most widely used one is amelogenin, which differentiates a 6 bp deletion on the X-chromosome from the Y-chromosome. We have listed primer sequences, PCR product sizes, and references for several sex-typing markers including amelogenin (14,15), the centromeric alphoid repeat (16), and the ZFX/ZFY zinc finger gene (17,18).
Addresses for scientists working with STR markers are available in STRBase with the idea that email links can be made to every scientist doing DNA typing with STRs. Most of these addresses were obtained from the correspondence addresses listed in published papers. We invite others to add their name, phone number, and e-mail address to this portion of STRBase to aid cooperation of DNA typing laboratories around the world.
STRBase Access and Data Acquistion
The short tandem repeat DNA database is available to the general public through the World Wide Web: http://ibm4.carb.nist.gov:8800/dna/home.htm (Figure 1). When using information from STRBase, please cite this paper and the date which the information was gathered from STRBase.
The information contained in STRBase is taken from published works on short tandem repeats used for DNA typing purposes. The literature is regularly searched for new publications and updates to STRBase are made from time-to-time. Comments on the database, suggestions for further improvements, or submissions should be sent to Dr. Dennis J. Reeder, attn: STRBase, National Institute of Standards and Technology, Biotechnology Division, Building 222 Room A353, Gaithersburg, MD 20899 or dennis.reeder@nist.gov.
REFERENCES
1. Edwards, A., Civitello, A., Hammond, H.A., and Caskey, C.T. (1991) DNA typing and genetic mapping with trimeric and tetrameric tandem repeats. Am. J. Hum. Genet. 49, 746-756.
2. The Utah Marker Development Group (1995) A collection of ordered tetranucleotide-repeat markers from the human genome. Am. J. Hum. Genet. 57, 619-628.
3. Hammond, H.A., Jin, L., Zhong, Y., Caskey, C.T., and Chakraborty, R. (1994) Evaluation of 13 short tandem repeat loci for use in personal identification applications. Am. J. Hum. Genet. 55, 175-189.
4. Kimpton, C.P., Gill, P., Walton, A., Urquhart, A., Millican, E.S., and Adams, M. (1993) Automated DNA profiling employing multiplex amplification of short tandem repeat loci. PCR Meth. Appl. 3, 13-22.
5. Urquhart, A., Kimpton, C.P., Downes, T.J., and Gill, P. (1994) Variation in short tandem repeat sequences--a survey of twelve microsatellite loci for use as forensic identification markers. Int. J. Leg. Med. 107, 13-20.
6. Sprecher, C.J., Puers, C., Lins, A.M., and Schumm, J.W. (1996) General approach to analysis of polymorphic short tandem repeat loci. BioTechniques 20, 266-276.
7. Gill, P., Kimpton, C.P., Urquhart, A., Oldroyd, N.J., Millican, E.S., Watson, S.K., and Downes, T.J. (1995) Automated short tandem repeat (STR) analysis in forensic casework--a strategy for the future. Electrophoresis 16, 1543-1552.
8. Puers, C., Hammond, H.A., Jin, L., Caskey, C.T., and Schumm, J.W. (1993) Identification of repeat sequence heterogeneity at the polymorphic short tandem repeat locus HUMTH01[AATG]n and reassignment of alleles in population analysis by using a locus-specific allelic ladder. Am. J. Hum. Genet. 53, 953-958.
9. Benson, D.A., Boguski, M., Lipman, D.J., and Ostell, J. (1994) GenBank. Nucleic Acids Res. 22, 3441-3444.
10. Hammer, M.F. and Horai, S. (1995) Y chromosomal DNA variation and the peopling of Japan. Am. J. Hum. Genet. 56, 951-962.
11. Roewer, L., Kayser, M., Dieltjes, P., Nagy, M., Bakker, E., Krawczak, M., and de Knijff, P. (1996) Analysis of molecular variance (AMOVA) of Y-chromosome-specific microsatellites in two closely related human populations. Hum. Mol. Genet. 5, 1029-1033.
12. Roewer, L., Arnemann, J., Spurr, N.K., Grzeschik, K.-H., and Epplen, J.T. (1992) Simple repeat sequences on the human Y chromosome are equally polymorphic as their autosomal counterparts. Hum. Genet. 89, 389-394.
13. Roewer, L., Kayser, M., Nagy, M., and de Knijff, P. (1996) Male identification using Y-chromsomal STR polymorphisms. In Advances in Forensic Haemogenetics, Volume 6, Carracedo, A., Brinkmann, B., and Bar, W.(Eds). Springer-Verlag: New York, pp. 124-126.
14. Sullivan, K.M., Mannucci, A., Kimpton, C.P., and Gill, P. (1993) A rapid and quantitative DNA sex test: fluorescence-based PCR analysis of X-Y homologous gene amelogenin. BioTechniques 15, 637-641.
15. Eng, B., Ainsworth, P., and Waye, J.S. (1994) Anomalous migration of PCR products using nondenaturing polyacrylamide gel electrophoresis: the amelogenin sex-typing system. J. Forensic Sci. 39, 1356-1359.
16. Lin, Z., Kondo, T., Minamino, T., Ohtsuji, M., Nishigami, J., Takayasu, T., Sun, R., and Ohshima, T. (1995) Sex determination by polymerase chain reaction on mummies discovered at Taklamakan desert in 1912. Forensic Sci. Int. 75, 197-205.
17. Reynolds, R. and Varlaro, J. (1996) Gender determination of forensic samples using PCR amplification of ZFX/ZFY gene sequences. J. Forensic Sci. 41, 279-286.
18. Stacks, B. and Witte, M.M. (1996) Sex determination of dried blood stains using the polymerase chain reaction (PCR) with homologous X-Y primers of the zinc finger protein gene.
J. Forensic Sci. 41, 287-290.
19 Chakraborty, R. (1992) Sample size requirements for addressing the population genetic issues of forensic use of DNA typing. Human Biology 64:141-159.
Table 1. Repeat Nomenclature Differences between GenBank® Top and Bottom Strands for Commonly Used STR Loci. Repeats are listed as the first full repeat going in the 5--> 3 direction. The loci in bold have a different number of repeats on the top and bottom strand with the repeat sequences shown here.
| STR Locus |
Top Strand Repeat |
# Repeats |
Bottom Strand Repeat |
# Repeats |
CSF1PO |
AGAT |
12 |
CTAT |
12 |
F13A1 |
AAAG |
7 |
CTTT |
7 |
F13B |
TTTA |
10 |
AAAT |
9 |
FES/FPS |
ATTT |
11 |
AAAT |
11 |
FGA |
TTTC (etc.) |
21 |
GAAA (etc.) |
21 |
HPRTB |
TCTA |
13 |
AGAT |
12 |
LPL |
TTTA |
8 |
AAAT |
8 |
TH01 |
TCAT |
9 |
AATG |
9 |
TPOX |
AATG |
11 |
CATT |
11 |
VWA |
TCTA (etc.) |
20 |
ATAG (etc.) |
20 |
CD4 |
TTTTC |
10 |
GAAAA |
10 |
DYS19 |
TAGA |
12 |
ATCT |
12 |
D5S818 |
AGAT |
11 |
TATC |
11 |
D7S820 |
GATA |
12 |
TCTA |
12 |
D13S317 |
TATC |
13 |
GATA |
13 |
D16S539 |
GATA |
11 |
TATC |
11 |
D8S1179 |
TATC |
12 |
ATAG |
12 |
D18S51 |
GAAA |
13 |
CTTT |
13 |
Table 2. Internet-Accessible Databases which Contain Useful Information on STR Markers
| BLAST Search (for PCR primer sequences) | http://www.ncbi.nlm.nih.gov/BLAST/ |
| Center for Medical Genetics | http://genetics.mfldclin.edu/ |
| Cooperative Human Linkage Center | http://www.chlc.org/ |
| The Genome Database | http://gdbwww.gdb.org/ |
| GenBank® | http://www2.ncbi.nlm.nih.gov/cgi-bin/genbank |
| Medline | http://www.ncbi.nlm.nih.gov/PubMed/ |
| MITOMAP (A human mitochondrial DNA database) | http://www.gen.emory.edu/mitomap.html |
| Patent Search (USPTO) | http://patents.cnidr.org/access/search-adv.html |
| STRBase | http://ibm4.carb.nist.gov:8800/dna/home.htm |
Table 3. Summary of STR References. As of August 13, 1997, there were 501 total references included in STRBase from over 50 different sources. Listed below are the number of papers describing information from each STR locus.
| STR Loci | # papers | w/ population data |
| HUMTH01 | 255 |
109 |
| VWA | 191 |
86 |
| FES/FPS | 122 |
58 |
| F13A1 | 89 |
39 |
| SE33 | 74 |
31 |
| TPOX | 63 |
27 |
| CSF1PO | 60 |
24 |
| D21S11 | 47 |
18 |
| F13B | 40 |
19 |
| HPRTB | 31 |
13 |
| FABP | 29 |
11 |
| CD4 | 32 |
17 |
| FGA | 29 |
5 |
| LPL | 28 |
13 |
| DYS19 | 21 |
12 |
| D18S51 | 20 |
3 |
| ARA | 17 |
12 |
| RENA4 | 14 |
9 |
| P450 (CYAR04) | ||
| MBP | 13 |
5 |
| D12S391 | 4 |
2 |
| D3S1358 | 3 |
2 |
Figure 1. STRBase Home Page Located at http://ibm4.carb.nist.gov:8800/dna/home.htm

Figure 2. HUMTH01 STR Fact Sheet

Figure 3. Chromosomal Index of STRs and Other DNA Typing Markers

Figure 4. Annotated HUMTH01 Sequence from GenBank® Accession Number D00269
The arrows indicate primer sequences while the brackets show the repeats. Primer sequences in blue are from the Forensic Science Service (4) and in red are from Caskey et al. (1,3).

Go to proceedings home page