UPDATE (April 16, 2019): The paper has now been published in Nature Communications.
ORIGINAL POST (October 12, 2017):
The Human Genome Structural Variation Consortium, a successor to the 1000 Genomes Project Consortium, recently released a preprint describing an in-depth study of structural variant (SV) detection in human genomes. The scientists found that PacBio long-read sequencing and complementary technologies dramatically improve sensitivity for these important genomic elements when compared to standard short-read sequencing.
“Multi-platform discovery of haplotype-resolved structural variation in human genomes” comes from lead authors Mark Chaisson, Ashley Sanders, and Xuefang Zhao; along with corresponding authors Charles Lee, Evan Eichler, and Jan Korbel; and many other consortium members. The study involved extensive sequencing of three family trios — Han Chinese, Puerto Rican, and Yoruban — for comprehensive discovery of structural variants. “The Han Chinese and Yoruban Nigerian families were representative of low and high genetic diversity genomes, respectively, while the Puerto Rican family was chosen to represent an example of population admixture,” the scientists write.
To date, attempts to identify all structural variants in a human genome using short-read technology have been unsuccessful. These variants are biologically and clinically relevant, so it is imperative that the community find better ways of resolving them. As the authors stated, “The incomplete identification of structural variants from whole-genome sequencing data limits studies of human genetic diversity and disease association.” For this project, they add, “we integrated a suite of cutting-edge genomic technologies that, when used collectively, allow structural variants to be assessed in a near-complete, haplotype-aware manner in diploid genomes.” Tools included short-read sequencing, SMRT Sequencing, optical mapping, synthetic long reads, and single-cell/single-strand sequencing. The authors also applied multiple analysis algorithms for each type of data, further improving sensitivity.
The team identified in each genome more than 800,000 indel variants smaller than 50 bp and nearly 32,000 structural variants 50 bp or larger. That is “a sevenfold increase in structural variation compared to previous reports, including from the 1000 Genomes Project.” The authors also report the identification of “156 inversions per genome—most of which previously escaped detection—as well as large unbalanced chromosomal rearrangements.”
An evaluation of the contribution by technology showed that PacBio sequencing has a threefold increase in sensitivity for structural variants compared to Illumina sequencing, likely resulting “from better access to intermediate-sized SVs (50 bp to 500 bp) and improved sequence resolution of insertions across the SV size spectrum,” the team notes. “The long-read sequence data provided us with an unprecedented view of genetic variation in the human genome,” the scientists add. “Using ~15 kbp reads at an average of 40-fold sequence coverage per child, we have been able to span areas of the genome that were previously opaque and discover three to fourfold more structural variation when compared to short-read sequencing platforms.” They estimate that 77% of insertions are routinely missed by variant-calling algorithms based on short-read data.
“This study represents the most comprehensive assessment of structural variation in human genomes to date,” the authors write. “We predict that a move forward to full-spectrum SV detection using an integrated approach demonstrated in this study will increase the diagnostic yield in patients with genetic disease, SV-mediated mutation, and repeat expansions.”
This research will be showcased next week at the American Society of Human Genetics (ASHG) annual meeting in Orlando. Charles Lee will present a talk entitled “Multi-platform Discovery of Haplotype-resolved Structural Variation in Human Genomes” at the PacBio workshop on Wednesday, October 18th at 12:30 pm. In the afternoon, Xuefang Zhao will present poster #1501, “Comprehensive Discovery of Genomic Variation from the Integration of Multiple Sequencing and Discovering Technologies.” Check out the complete list of presentations at ASHG 2017 featuring SMRT Sequencing.