In a new paper reporting a protocol for using short-read sequence data to locate short tandem repeats (STRs), scientists find that long-read sequence information is necessary to resolve regions with repeat complexity, extreme GC content, and other challenging factors. Their solution is to use short-read data to find STRs, and then to use long-read sequencing to fully characterize those repeat expansions.
The Bioinformatics publication is entitled “Rapid detection of expanded short tandem repeats in personal genomics using hybrid sequencing” and came from scientists Koichiro Doi, Shinichi Morishita, et al. at the University of Tokyo. They focused on resolving STRs across whole genomes because of their links to genetic disease, noting that exome capture is insufficient to fully characterize these repeat units, which are often found in non-exonic regions of the genome.
Doi et al. report that short-read sequence data has traditionally proven inadequate for elucidating STRs that span more than 100 bp, the average length of a short read. In this project, they developed an efficient computational program along with ab initio procedures to sense and locate STRs by scanning massive short-read data sets and analyzing frequency distributions of approximate STRs based on length.
However, they note, “As genomic regions of GC content > 70% are difficult to cover with an ample number of Illumina® reads, our method is unlikely to detect long expansions of STRs with high GC contents. STRs in reads originating in centromeres, telomeres, or retrotransposons are too numerous to map to unique genomic positions.”
To fully analyze longer STRs, the team utilized Single Molecule, Real-Time (SMRT®) Sequencing on 11 samples from patients with a brain disease. Through this approach, they report, “we were able to determine a divergent set of [two] 3-3.1 kb STR sequences in eleven SCA31 samples, showing the instability of STR expansions.” By combining both methods — genome scanning with short-read data to find STR locations and sequencing those structural variants with the PacBio® platform — the scientists were able to rapidly hone in on long STRs implicated in human disease.
Looking ahead, the authors suggest that there is much to be learned about STRs longer than 1 kb and whether STR expansions occur more often in germline or somatic cells. “Analysis of the stability of STR expansions in germline and somatic cells of a specific disease might eventually lead to the recognition of a functional role of STRs,” they write.