A new paper out in Nature extends our view into the human genome and challenges current ideas about genetic variation. “Resolving the complexity of the human genome using single-molecule sequencing” comes from first author Mark Chaisson, senior author Evan Eichler, and their collaborators at the University of Washington, University of Bari Aldo Moro, and University of Pittsburgh. In the paper, the scientists describe an important effort to fill gaps and better characterize structural variation in the human genome by using Single Molecule, Real-Time (SMRT®) Sequencing data.
The team sequenced a haploid human genome, using a hydatidiform mole cell line (CHM1), to about 40x coverage. Eichler’s group was able to close or shrink 55 percent of the 160 euchromatic gaps existing in the reference genome, the vast majority of them marked by GC-rich regions with several kilobases of short tandem repeats. The approach used repeated rounds of mapping and assembling data, and added more than 1 Mb of novel sequence — including novel exons and putative regulatory sequences — to the genome.
Perhaps the most remarkable advances from this project came from resolving structural variants across the human genome. The team built a computational pipeline to analyze these elements genome-wide. “We identified a total of 26,079 insertion/deletions ≥50 bp within the euchromatic portion of the genome,” the authors write. “Almost all insertion and deletion breakpoints were resolved at the single-basepair level generating one of the most comprehensive catalogs of structural variation (47,238 breakpoint positions).” They note that only a fraction of these variants can be detected with short-read sequencing technology.
For instance, 85 percent of all copy number differences the scientists found had never been previously reported in structural variation studies. For insertions, 92 percent were novel; 69 percent of deletions had never been detected before.
The scientists also found a significant amount of highly complex structural variants, such as insertions containing multiple annotated repeats. “Complex repetitive regions such as these represent a major challenge in SV detection due to spurious mapping of short-read sequence data,” they note. Only a very small fraction — less than 1 percent — of these complex variants are included in current human genome assemblies such as the new reference genome. “Since we find evidence of most of these complex events in additional human or chimpanzee genomes, we propose that these ~1700 sites (3.5 MB) represent deficiencies or ‘muted’ gaps that can now be accessed as a result of SMRT technology,” the authors write.
Eichler’s team has incorporated all of this novel sequence data and made it publicly available. Their work will allow other scientists to get a far clearer view of the human genome than has previously been possible. “Our results suggest a greater complexity of the human genome in the form of variation of longer and more complex repetitive DNA that can now be largely resolved with the application of this longer-read sequencing technology,” they write.