Long-read sequencing technologies such as PacBio HiFi sequencing are quickly becoming the new gold standard in genomics research. This article provides an introductory look at what long-read sequencing is, and explores topics including advantages, applications, and more.
Background
Deciphering the essence of what makes living things work has been a central goal for scientists around the world for nearly two centuries, from Mendel’s first conjectures on the laws of biological inheritance to Nirenberg’s cracking of the genetic code and beyond. Today, the sheer complexity of many of the most pressing research questions in biology requires considering not just single genes or inheritance patterns but the entirety of an organism’s genetic information (the genome) and its myriad functions.
Because genomes can be millions to even billions of bases long, extracting one in its entirety from a sample intact is a practical impossibility, at least for now. Instead, researchers utilize instruments to reconstruct genomic sequence information from small individual fragments. In the first part of the process, genomes are broken up into a staggering number of pieces before the actual sequencing, reassembly, and analysis can take place. Depending on the technology being used, the extracted DNA sample will undergo several preparatory steps to ensure that the sequencing machine is being provided with fragments that are sized appropriately for the system’s capabilities. Sequencing instruments are categorized as long-read or short-read based on their underlying chemistry and the length of DNA fragments they analyze.
What is long-read sequencing?
Long-read sequencing is a type of nucleic acid sequencing that produces genomic data by generating individual reads that are each derived from a single molecule which is thousands of nucleotides or more in length.
Long-read sequencing uses DNA (or RNA) fragments ranging in size from 1,000 to 20,000 bases or more. These fragments are often derived from what are referred to as “native” molecules which are extracted directly from the biological sample for analysis. In contrast, most short-read sequencing technologies use fragments that are 50 to 300 bases long. Unlike most long-read approaches, short-read solutions are unable to sequence native molecules effectively and instead require that extracted DNA be synthetically copied prior to analysis.
What are the advantages of long-read sequencing?
It should come as no surprise that the fundamental differentiator between long- and short-read sequencing is the length of the molecules being analyzed. Each has its own strengths and weaknesses that depend on the intended research application.To understand the difference between short-read assembly and long-read assembly and to see why long-read sequencing excels in areas such as whole genome reconstruction, consider the following example.
The benefits of assembling a genome with long-read sequencing: an example using books
One way to think about the difference between assembling a genome with short-reads versus long-read sequencing technology would be to imagine two different approaches to reconstructing a 500-page novel from randomized snippets of text.
Short-read sequencing is equivalent to working only with fragmented statements like “then there were” or “sometimes she.” It would be daunting to reconstruct a novel from such short pieces of text because they are incomplete and lack the contextual information needed to help us place them in order correctly. Similarly, reconstructing a highly accurate and highly detailed copy of a whole genome (the book in our analogy) using only short-read sequencing data can be exceedingly difficult, requiring overly complex and computationally intensive mathematical models to accomplish. Even after overcoming the challenge of piecing all these fragments back together in order, the final short-read assembly still often contains numerous errors and gaps of missing information.
With long-read sequencing data, assembling a genome is much easier, more like piecing together our 500-page novel with snippets of text that encompass entire paragraphs instead of small fragments. These long runs of text provide contextual information about key events in the plot, making it much easier to place them in order correctly to reconstruct the story. Similarly, the obstacles that a researcher must overcome to create a genome assembly with long reads are much more straightforward, with fewer and less complex computational steps than short-read solutions.
The advantages of long reads depend on accuracy
“…while read length is frequently suggested as a dominant factor…, our results demonstrate that the benefits of read length are overshadowed by the higher sequencing accuracy of the HiFi technology.”
– Mahmoud et al. 2023
Building on the analogy of the 500-page novel, it is important to note that not all long-read sequencing technologies are created equal. There is one critical element that sets competing long-read technologies apart: accuracy.
Correctly assembling a genome is no small undertaking and though long reads provide more context than short reads for this task, the advantages are diminished without sufficient accuracy. Using our book analogy, having inaccurate long-read data would be like having long snippets of text that are coherent enough to provide plot context but are at the same time filled with spelling errors and garbled, nonsensical text that make it difficult to discern exactly how a key event happened and when. Much like with short reads, overcoming the analytical challenges presented by inaccurate long-read technologies can be time-consuming and require complex computational processing and data polishing. In our analogy, if the genomic equivalent of a summary of the novel is all that is required, then this level of accuracy might be acceptable. However, if the task demands that every letter and every point of punctuation be as correct as possible in the reconstruction, then the best possible snippets are required.
To meet this dual demand for both length and accuracy in genomic analysis, HiFi sequencing was developed by scientists at PacBio.
What is HiFi sequencing?
HiFi sequencing is a single-molecule, long-read sequencing technology that produces reads that are both long and accurate. HiFi sequencing was developed by PacBio and is the core chemistry run on all PacBio long-read sequencing instruments.
HiFi sequencing has its origins in the nanofluidic designs and single molecule real-time chemistry developed by PacBio CTO Dr. Stephen Turner and CSO Dr. Jonas Korlach at Cornell University in the early 2000s.
Unlike other long-read technologies that suffer from highly variable chemistry and data quality, HiFi sequencing is distinct in that it can provide researchers with very consistent sequencing performance with reads that are 15,000 to 20,000 bases or more in length. Furthermore, the consensus approach used to determine a sequence (see “how it works” section below) allows HiFi sequencing to achieve an accuracy of 99.9%. Combined, these length and accuracy metrics make HiFi sequencing one of the most powerful sequencing technologies in the world for studying the most complex and technically challenging aspects of genomics.
In recognition of its important contribution to advances in genomic research, HiFi sequencing was co-awarded the prestigious title of 2022 Method of the Year by the journal Nature Methods.
Want an authoritative eBook on HiFi sequencing to share with students or colleagues?
How does HiFi sequencing work?
HiFi sequencing begins when circularized fragments of sample DNA, suspended in solution, are flooded across the surface of a nanofluidic chip called a SMRT (Single Molecule, Real-Time) Cell. The surface of this chip is checkered with many millions of cylindrically shaped recesses — or wells — called zero-mode waveguides (ZMWs) that are each only nanometers wide. As a sample flows over the SMRT Cell, the circularized pieces of DNA are immobilized at the bottom of the ZMWs. Once the sample DNA is situated inside a ZMW, free-floating nucleotides are added and a DNA polymerase enzyme that was attached to the sample DNA during library preparation begins to copy the sample molecule. As the polymerase incorporates new nucleotide bases into the newly replicated strand, a tiny amount of light is released and is picked up by a detector. Depending on the light emitted, the sequencing system can determine which DNA base (adenine, thymine, cytosine, or guanine) was incorporated.
Much like a race car making repeated laps around a circular racetrack, the DNA polymerase in HiFi sequencing works its way around the circularized sample molecule many times over. Because the polymerase generates multiple copies of each piece of DNA held within the ZMWs, PacBio long-read sequencing systems can pinpoint the sample’s correct sequence by cross-referencing each copy of the molecule to maximize accuracy in what is called circular consensus sequencing (CCS).
Once the data from all the ZMWs in a SMRT Cell has been compiled, a primary data output is generated –ready for downstream analysis by a researcher.
Both the Sequel IIe system and newer PacBio long-read sequencing platforms measure the speed at which each base is incorporated by the polymerase. That information is then used by PacBio SMRT Link software to determine whether the base is methylated, which is critical for epigenetic studies.
What are the advantages of HiFi sequencing?
The benefits that HiFi sequencing can bring to a specific study or research discipline are numerous, but the following four features give genomic researchers important benefits over alternative sequencing approaches, regardless of the research application.
Long read lengths
HiFi sequencing delivers reads of 15,000 to 20,000 base pairs or more, enabling researchers to confidently assemble reference-grade genomes and sequence full-length RNA transcripts.
High accuracy
Through circular consensus, HiFi sequencing generates reads with 99.9% accuracy.
Uniform coverage
By eliminating bias associated with amplification, HiFi sequencing enables researchers to analyze genomic regions that are often inaccessible to other technologies (such as hard-to-sequence AT and GC-rich content, highly repetitive areas, long homopolymers, and palindromic sequences).
Native methylation detection
By sequencing DNA extracted directly from a sample without amplification, base modifications can be detected during sequencing through the measurement of base incorporation kinetics. This allows for the capture of both sequence and methylation information in a single experiment with no additional preparatory steps required.
What are the applications of HiFi sequencing?
With its ability to generate long, accurate read data with uniform genomic coverage and native methylation detection, HiFi sequencing has many genomic analysis applications that can be leveraged across the full spectrum of biology disciplines.
HiFi sequencing applications, at a glance:
- Haplotype phasing
- Detection of large and complex variants
- Comprehensive and accurate genome assembly
- Epigenetics
Haplotype phasing
When searching for the genetic basis for a desirable crop characteristic or the origins of a complex heritable disease in humans, being able to fully distinguish each chromosomal copy, or haplotype (e.g., maternally, or paternally inherited) from another, in a process called phasing, is critical. The long-range capabilities of HiFi sequencing reduce the statistical complexity and increase the confidence in correctly reconstructing each chromosomal copy. In most cases, HiFi sequencing eliminates the need for trio or population-based phasing techniques which can be a major strain on a research team’s limited time and resources. In a recent study on the genomics of spinal muscular atrophy (SMA), researchers used HiFi sequencing to identify two SMN1 haplotypes forming a common two-copy SMN1 allele in African populations. Testing positive for these two haplotypes in an individual with two copies of SMN1 gives a silent carrier risk of 88.5%, which is significantly higher than the currently used SNP (single nucleotide polymorphism) marker of 1.7%–3.0%, demonstrating the potential benefits of HiFi sequencing for enabling the development of haplotype-phased screening of silent carriers for SMA.
Variant detection
The ability of HiFi sequencing reads to span across large regions of the genome makes them adept at detecting variants on a genome-wide scale. Instances where exceptionally large insertion-deletion events have occurred are notoriously difficult to detect and are an area of specialty for HiFi sequencing. Similarly, HiFi reads can help researchers detect changes in tandem repeats and other regions with highly recurrent sequences that cannot be analyzed without long and accurate reads which can span across them correctly. Until recently, genome-wide association studies (GWAS) have had difficulty explaining the heritability of complex diseases. However, the variant detection capabilities of HiFi sequencing have enhanced the correct identification of structural variants (genomic variants 50 to 1,000 bp or more). This has improved researchers’ ability to link disease phenotypes to novel genes and causative variants, enabling them to begin closing the gap on the missing heritability problem in certain genetic diseases.
Genome assembly
HiFi sequencing is the premier technology for highly accurate genome assembly across the full range of life forms, from bacteria to humans and even giant California redwoods. The length and accuracy of HiFi data ensures sufficient overlap between individual reads, even in areas of high homology, which enables assembly software such as hifiasm to reconstruct genomes that contain fewer errors and areas of uncertainty. Taking advantage of these strengths, scientists at the T2T Consortium used HiFi sequencing to help close the remaining 8% of missing information in the human genome and present to the world the first complete human genome assembly in March of 2022.
Epigenetics
The ability of HiFi sequencing to directly analyze sample molecules without an amplification step gives researchers access to base modification information (such as methylation) in addition to the traditional base-calling data. This opens a range of new possibilities for studies focused on understanding heritable changes in gene expression in humans and other organisms. Moreover, because this methylation data is generated in conjunction with other HiFi applications, researchers can pinpoint and study epigenetic effects in a haplotype-phased and variant-called genomic context. In a creative use of this methylation detection capability, scientists researching gene therapeutics have even begun to use HiFi methylation detection to identify breaks and structural defects in their designs.
The future of genomic discovery is long
As scientists continue to search for answers to biological questions that span everything from ecosystem function to human health, the need for increasingly more powerful and sophisticated genomic tools has become ever more important. For discovery-oriented research applications, long-read sequencing and especially HiFi sequencing hold tremendous promise as they can outperform the current standard in almost every aspect of genomic analysis. As a result, the potential for these state-of-the-art long-read technologies to usher in a new era of genomic discovery is no longer just around the corner, it is here.
Want more information on long reads and HiFi sequencing?
Check out other topics in the Sequencing 101 series
Visit the PacBio HiFi sequencing page
Get an authoritative eBook on HiFi sequencing to share with students or colleagues
References
M. Mahmoud, Y. Huang, K. Garimella, P. A. Audano, W. Wan, N. Prasad, R. E. Handsaker, S. Hall, A. Pionzio, M. C. Schatz, M. E. Talkowski, E. E. Eichler, S. E. Levy, F. J. Sedlazeck Utility of long-read sequencing for All of Us. bioRxiv [Preprint]. 2023.01.23.; doi: https://doi.org/10.1101/2023.01.23.525236