In the field of genetics, the concept of tandem repeats has been both scientifically fascinating, experimentally challenging, and motivating for technology development. As DNA sequencing technologies and analysis tools have evolved, scientists are now able to reveal the secrets hidden within these repetitive sequences, shedding light on their significance in the genetics of humans and other organisms.
Join us as we explore their nature, significance, and the ways in which PacBio sequencing can enable you to study them more effectively than ever before.
What are tandem repeats?
Tandem repeats are sequences of DNA comprising of two or more nucleotides that are repeated in a contiguous, head-to-tail fashion on a chromosome. These repeating units of DNA, also known as repeat motifs, can appear as just a couple of repetitions to many hundreds within a single chromosomal region.
What are the different kinds of tandem repeats?
Micro, mini, and macro. Short, variable, and long. Satellite. If you do a quick google search on the types of tandem repeats, you will find a dizzying number of different – and often synonymous – terms for the varieties of tandem repeats that appear in a genome. Despite these apparent inconsistencies in terminology,
Tandem repeats can generally be categorized into two main types:
Short tandem repeats (STRs)
Also known as a microsatellite or simple sequence repeats (SSRs), a short tandem repeat is a tandemly repeating unit of DNA with a motif that is anywhere from 1 to 6 base pairs (bp) in size1.
Variable number tandem repeats (VNTRs)
Sometimes referred to as minisatellites and macrosatellites, a variable number tandem repeat is a tandemly repeating unit of DNA with a motif that is ≥7 bp in size1. Within the body of literature on tandem repeats there are instances where motifs ≥100 bp in size are called macrosatellites. However, this usage appears inconsistently and following the definition above, they would still be deemed VNTRs.
Tandem repeat categorization does not depend on the number of copies of the repeated sequence
It is important to point out that these definitions are based on the size of the repeat unit – or motif –and the unit size alone. They have nothing to do with how many times a motif is repeated. For example, a three base pair long STR such as “ACG”, that is tandemly repeated 10,000 times (for 30,000 total base pairs) is still categorized as an STR. Similarly, a motif that is 50 bases long and repeated only three times (for 150 total base pairs) is still considered a VNTR even though it is much shorter overall than the sequence in the first example.
Where did the alternative “satellite DNA” terminology come from?
Before DNA sequencing was as accurate, affordable, and ubiquitous as it is today, researchers had to use alternative methods to deduce the composition of an organism’s genome. The term “satellite DNA” (and the associated terms micro, mini, and microsatellite) originates from the early characterization of certain fractions of genomic DNA during density gradient centrifugation2. In the mid-20th century, researchers employed density gradient centrifugation techniques to separate DNA based on its buoyant density. They observed that genomic DNA, when subjected to centrifugation, formed distinct bands with varying densities. Some appeared as satellite bands, positioned away from the main band of genomic DNA. When researchers were finally able to sequence these satellite bands of DNA, they discovered that they contained the various sizes of what we now call tandem repeats.
Why are tandem repeats ≥7 bp deemed to be “variable number” tandem repeats when those that are smaller are just called STRs? Are VNTRs more variable, hence the name?
As far as one can tell from the body of research on the topic, VNTRs and STRs are no more and no less variable than one another. This includes the frequency of point mutations within a given STR or VNTR motif, as well as the variability of how many times a STR or VNTR motif is repeated in the genome.
Why are tandem repeats important?
Despite predominantly residing in non-coding regions of genes, the influence of tandem repeats is more significant than you might think. They constitute >3% of the entire human genome and have a substantial impact on biology3,4. In fact, tandem repeats are responsible for a large fraction of structural variation that is longer than 50 base pairs5. These tandem repeat regions exhibit remarkable variability and have been shown to play a crucial role in phenotype in many eukaryotic organisms. Moreover, tandem repeats have been shown to contribute to a multitude of genetic diseases in humans –making them an important subject of biomedical research4.
Tandem repeats have been linked to changes in gene expression in various cancers and are associated with over 50 nervous system diseases, including ALS, FXS, ataxias, autism spectrum disorders, and schizophrenia6-10. Identifying tandem repeats and accurately capturing and cataloging their sequences is the first step toward comprehending how they drive diseases, potentially leading to the discovery of biomarkers, drug targets, and the development of therapeutics.
How are tandem repeats studied?
Unlike the density gradient centrifugation techniques of old, scientists today identify tandem repeats and study their intricacies using DNA sequencing. For VNTRs, long-read sequencing technologies like PacBio HiFi sequencing have been vital as they enable researchers to call bases accurately while spanning large repeats with plenty of read overlap. But sequencing chemistry alone is not the only thing responsible for the effectiveness of these new methodologies.
The study of tandem repeats has been revolutionized by innovative software tools like TRGT (Tandem Repeat Genotyping Tool) and TRVZ (Tandem Repeat Visualization Tool) which are designed to work in synergy with PacBio HiFi sequencing. This approach eliminates the challenges posed by conventional methods and equips researchers with extensive (>10,000 base pairs), highly accurate reads (99.9%) and a suite of specialized analysis tools tailored to the complexities of studying tandem repeats.
TRGT + TRVZ capabilities encompass:
Size genotyping and mosaicism estimation.
Sequence composition analysis, including interruptions and regions with multiple repeats.
5mC CpG methylation calling.
Visual display of haplotype-resolved read pileup and methylation status.
By combining HiFi sequencing with TRGT and TRVZ, researchers can overcome the mathematical hurdles of the past, empowering them to answer crucial questions about the role that these important genomic regions play in various aspects of genetics, from trait evolution to the biology of inherited disease.
If you’re ready to explore the potential of TRGT, you can access it now on GitHub and delve into greater detail with an on-demand presentation by PacBio scientist and co-developer of TRGT, Egor Dolzhenko.
Can tandem repeats be analyzed using short reads?
Yes and no. Until viable long-read sequencing techniques (such as HiFi sequencing) came onto the scene, VNTR sequences had posed an essentially insurmountable challenge for accurate analysis due to the sheer scale of their repetitive character. Conventional short-read sequencing methods (such as sequencing by synthesis or SBS) struggle to handle these repetitive regions because the reads are often too inaccurate and too short to span the entire length of the repeat and call it correctly. As a result, it becomes too mathematically and computationally difficult to piece together and align these small snapshots of the repeat sequence with certainty.
Mark Chaisson a bioinformatician at the University of Southern California and an expert on tandem repeat analysis has confirmed the impracticalities of analyzing VNTRs with SBS. For an in-depth discussion on this topic, watch Dr. Chaisson’s talk on the challenges of repeat analysis using traditional short reads and the benefits of “going long” when choosing a sequencing technology for VNTR analysis:
However, for studying STRs (also known as microsatellites) with low copy number, highly accurate short-read sequencing technology can be very effective.
SBB sequencing is a cutting-edge short-read technology that excels at short tandem repeats
Because STR motifs are very short (≤6 bp), errors inherent to the sequencing technology that is being used to study them can have a dramatic impact on the conclusions that are drawn from an analysis. Therefore, for utmost confidence in the results of an STR project, it’s crucial to use a DNA sequencing technology with exceptionally high raw read accuracy. And sequencing by binding (SBB) is exactly that.
For a complete background on SBB, check out this article.
SBB sequencing is a Q40+ (one error in ten thousand bases!) short-read sequencing technology that works by measuring the light signals from fluorescently labeled nucleotides when they are bound –but not incorporated– by a polymerase on a DNA strand. Unlike other short-read technologies, SBB separates the binding and subsequent extension steps of the sequencing process which eliminates the errors introduced by molecular artifacts resulting in extreme accuracy.
If you study tandem repeats, PacBio is for you
As we’ve seen, tandem repeats play a pivotal role in genetics, encompassing both Short Tandem Repeats (STRs) and Variable Number Tandem Repeats (VNTRs), and significantly influencing structural variation and genetic diseases. The precision of PacBio HiFi sequencing, enhanced by innovative software like TRGT and TRVZ, provides scientists with extraordinary accuracy and confidence in their study of these sequences. Whether it’s unraveling the intricacies of VNTRs or applying the meticulousness of SBB sequencing to STRs, PacBio offers the essential tools for demystifying these complex genomic elements. Step into the future of genomic research with PacBio sequencing, where comprehensive access to tandem repeat data paves the way for groundbreaking insights and discoveries.
References
- Sulovari A, Li R, Audano PA, et al. Human-specific tandem repeat expansion and differential gene expression during primate evolution. Proc Natl Acad Sci U S A. 2019;116(46):23243-23253. (link)
- Tautz D. Notes on the definition and nomenclature of tandemly repetitive DNA sequences. EXS. 1993;67:21-28. (link)
- Lander ES, Linton LM, Birren B, et al. Initial sequencing and analysis of the human genome. Nature. 2001;409(6822):860-921. (link)
- Usdin K. The biological effects of simple tandem repeats: lessons from the repeat expansion diseases. Genome Res. 2008;18(7):1011-1019. (link)
- Chaisson MJP, Huddleston J, Dennis MY, et al. Resolving the complexity of the human genome using single-molecule sequencing. Nature. 2015;517(7536):608-611. (link)
- Cortese A, Simone R, Sullivan R, et al. Biallelic expansion of an intronic repeat in RFC1 is a common cause of late-onset ataxia. Nat Genet. 2019;51(4):649-658. (link)
- Hunter JE, Berry-Kravis E, Hipp H, Todd PK. FMR disorders. In: Adam MP, Feldman J, Mirzaa GM, et al., eds. GeneReviews®. University of Washington, Seattle; 1993. (link)
- Mojarad BA, Engchuan W, Trost B, et al. Genome-wide tandem repeat expansions contribute to schizophrenia risk. Mol Psychiatry. 2022;27(9):3692-3698. (link)
- Siddique N, Siddique T. Amyotrophic lateral sclerosis overview. In: Adam MP, Feldman J, Mirzaa GM, et al., eds. GeneReviews®. University of Washington, Seattle; 1993. (link)
- Trost B, Engchuan W, Nguyen CM, et al. Genome-wide detection of tandem DNA repeats that are expanded in autism. Nature. 2020;586(7827):80-86. (link)