Menu

Scientific publications

Publications featuring PacBio long-read + short-read sequencing data

Genome Research  |  2025

Analytical validation of germline small variant detection using long-read HiFi genome sequencing

Nathan Hammond et al

Long-read sequencing has the capacity to interrogate difficult genomic regions and phase variants; however, short-read sequencing is more commonly implemented for clinical testing. Given the advances in long-read HiFi sequencing chemistry and variant calling, we analytically validated this technology for small variant detection (single nucleotide variants, insertions/deletions; SNVs/indels; <50bp). HiFi genome sequencing was performed on DNA from reference materials and clinical specimen types, and accuracy results were compared to short-read genome sequencing data. HiFi genome sequencing recall and precision across Genome in a Bottle (GIAB)-defined nondifficult and difficult genomic regions (high confidence) for SNVs were >99.9% and >99.7%, respectively, and for indels were >99.8% and >99.1%, respectively. Moreover, HiFi genome sequencing outperformed short-read genome sequencing on overall SNV/indel F1-score accuracy at all paired sequencing depths, which were further stratified across 100 total GIAB-defined genomic regions for a comprehensive evaluation of performance. Of note, HiFi genome sequencing F1-scores for SNVs and indels surpassed 99% at ~15×. and ~25×, respectively. In addition, high confidence small variant concordance across all HiFi genome sequencing reproducibility assessments (two specimens, three independent sequencing datasets) were >99.8% for SNVs and >98.6% for indels, and average high confidence small variant concordance between paired blood, saliva, and swab specimens were all >99.8%. Taken together, these data underscore that long-read HiFi genome sequencing detection of SNVs and indels is very accurate and robust, which supports the implementation of this technology for clinical diagnostic testing.
Nature  |  2025

Complete sequencing of ape genomes

Yoo, D., Rhie, A., Hebbar, P. et al.

The most dynamic and repetitive regions of great ape genomes have traditionally been excluded from comparative studies1,2,3. Consequently, our understanding of the evolution of our species is incomplete. Here we present haplotype-resolved reference genomes and comparative analyses of six ape species: chimpanzee, bonobo, gorilla, Bornean orangutan, Sumatran orangutan and siamang. We achieve chromosome-level contiguity with substantial sequence accuracy (<1 error in 2.7 megabases) and completely sequence 215 gapless chromosomes telomere-to-telomere. We resolve challenging regions, such as the major histocompatibility complex and immunoglobulin loci, to provide in-depth evolutionary insights. Comparative analyses enabled investigations of the evolution and diversity of regions previously uncharacterized or incompletely studied without bias from mapping to the human reference genome. Such regions include newly minted gene families in lineage-specific segmental duplications, centromeric DNA, acrocentric chromosomes and subterminal heterochromatin. This resource serves as a comprehensive baseline for future evolutionary studies of humans and our closest living ape relatives.
PLOS Computational Biology  |  2025

Analysis of targeted and whole genome sequencing of PacBio HiFi reads for a comprehensive genotyping of gene-proximal and phenotype-associated Variable Number Tandem Repeats

Sara Javadzadeh, Aaron Adamson, Jonghun Park,Se-Young Jo,Yuan-Chun Ding, Mehrdad Bakhtiari, Vikas Bansal, Susan L. Neuhausen ,Vineet Bafna

Variable Number Tandem repeats (VNTRs) refer to repeating motifs of size greater than five bp. VNTRs are an important source of genetic variation, and have been associated with multiple Mendelian and complex phenotypes. However, the highly repetitive structures require reads to span the region for accurate genotyping. Pacific Biosciences HiFi sequencing spans large regions and is highly accurate but relatively expensive. Therefore, targeted sequencing approaches coupled with long-read sequencing have been proposed to improve efficiency and throughput. In this paper, we systematically explored the trade-off between targeted and whole genome HiFi sequencing for genotyping VNTRs. We curated a set of 10 , 787 gene-proximal (G-)VNTRs, and 48 phenotype-associated (P-)VNTRs of interest. Illumina reads only spanned 46% of the G-VNTRs and 71% of P-VNTRs, motivating the use of HiFi sequencing. We performed targeted sequencing with hybridization by designing custom probes for 9,999 VNTRs and sequenced 8 samples using HiFi and Illumina sequencing, followed by adVNTR genotyping. We compared these results against HiFi whole genome sequencing (WGS) data from 28 samples in the Human Pangenome Reference Consortium (HPRC). With the targeted approach only 4,091 (41%) G-VNTRs and only 4 (8%) of P-VNTRs were spanned with at least 15 reads. A smaller subset of 3,579 (36%) G-VNTRs had higher median coverage of at least 63 spanning reads. The spanning behavior was consistent across all 8 samples. Among 5,638 VNTRs with low-coverage ( < 15), 67% were located within GC-rich regions ( > 60%). In contrast, the 40X WGS HiFi dataset spanned 98% of all VNTRs and 49 (98%) of P-VNTRs with at least 15 spanning reads, albeit with lower coverage. Spanning reads were sufficient for accurate genotyping in both cases. Our findings demonstrate that targeted sequencing provides consistently high coverage for a small subset of low-GC VNTRs, but WGS is more effective for broad and sufficient sampling of a large number of VNTRs.
bioRxiv  |  2025

Genetic diversity and regulatory features of human-specific NOTCH2NL duplications

Taylor D. Real, Prajna Hebbar, DongAhn Yoo, Francesca Antonacci, Ivana Pačar, Mark Diekhans, Gregory J. Mikol, Oyeronke G. Popoola, Benjamin J. Mallory, Mitchell R. Vollger, Philip C. Dishuck, Xavi Guitart, Allison N. Rozanski, Katherine M. Munson, Kendra Hoekzema, Jane E. Ranchalis, Shane J. Neph, Adriana E. Sedeño-Cortes, Benedict Paten, Sofie R. Salama, Andrew B. Stergachis, Evan E. Eichler

NOTCH2NL (NOTCH2-N-terminus-like) genes arose from incomplete, recent chromosome 1 segmental duplications implicated in human brain cortical expansion. Genetic characterization of these loci and their regulation is complicated by the fact they are embedded in large, nearly identical duplications that predispose to recurrent microdeletion syndromes. Using nearly complete long-read assemblies generated from 67 human and 12 ape haploid genomes, we show independent recurrent duplication among apes with functional copies emerging in humans ∼2.1 million years ago. We distinguish NOTCH2NL paralogs present in every human haplotype (NOTCH2NLA) from copy number variable ones. We also characterize large-scale structural variation, including gene conversion, for 28% of haplotypes leading to a previously undescribed paralog, NOTCH2tv. Finally, we apply Fiber-seq and long-read transcript sequencing to human cortical neurospheres to characterize the regulatory landscape and find that the most fixed paralogs, NOTCH2 and NOTCH2NLA, harbor the greatest number of paralog-specific elements potentially driving their regulation.
Nature  |  2025

Solanum pan-genetics reveals paralogues as contingencies in crop engineering

Benoit, M., Jenike, K.M., Satterlee, J.W. et al.

Pan-genomics and genome-editing technologies are revolutionizing breeding of global crops1,2. A transformative opportunity lies in exchanging genotype-to-phenotype knowledge between major crops (that is, those cultivated globally) and indigenous crops (that is, those locally cultivated within a circumscribed area)3,4,5 to enhance our food system. However, species-specific genetic variants and their interactions with desirable natural or engineered mutations pose barriers to achieving predictable phenotypic effects, even between related crops6,7. Here, by establishing a pan-genome of the crop-rich genus Solanum8 and integrating functional genomics and pan-genetics, we show that gene duplication and subsequent paralogue diversification are major obstacles to genotype-to-phenotype predictability. Despite broad conservation of gene macrosynteny among chromosome-scale references for 22 species, including 13 indigenous crops, thousands of gene duplications, particularly within key domestication gene families, exhibited dynamic trajectories in sequence, expression and function. By augmenting our pan-genome with African eggplant cultivars9 and applying quantitative genetics and genome editing, we dissected an intricate history of paralogue evolution affecting fruit size. The loss of a redundant paralogue of the classical fruit size regulator CLAVATA3 (CLV3)10,11 was compensated by a lineage-specific tandem duplication. Subsequent pseudogenization of the derived copy, followed by a large cultivar-specific deletion, created a single fused CLV3 allele that modulates fruit organ number alongside an enzymatic gene controlling the same trait. Our findings demonstrate that paralogue diversifications over short timescales are underexplored contingencies in trait evolvability. Exposing and navigating these contingencies is crucial for translating genotype-to-phenotype relationships across species.
Nature Genetics  |  2025

Long-read RNA sequencing atlas of human microglia isoforms elucidates disease-associated genetic regulation of splicing

Humphrey, J., Brophy, E., Kosoy, R. et al.

Microglia, the innate immune cells of the central nervous system, have been genetically implicated in multiple neurodegenerative diseases. Mapping the genetics of gene expression in human microglia has identified several loci associated with disease-associated genetic variants in microglia-specific regulatory elements. However, identifying genetic effects on splicing is challenging because of the use of short sequencing reads. Here, we present the isoform-centric microglia genomic atlas (isoMiGA), which leverages long-read RNA sequencing to identify 35,879 novel microglia isoforms. We show that these isoforms are involved in stimulation response and brain region specificity. We then quantified the expression of both known and novel isoforms in a multi-ancestry meta-analysis of 555 human microglia short-read RNA sequencing samples from 391 donors, and found associations with genetic risk loci in Alzheimer’s and Parkinson’s disease. We nominate several loci that may act through complex changes in isoform and splice-site usage.
Liebert Pub  |  2025

Analysis of HIV-1-Based Lentiviral Vector Particle Composition by PacBio Long-Read Nucleic Acid Sequencing

Saqlain Suleman, Mohammad S. Khalifa, Serena Fawaz, Sharmin Alhaque, Yaghoub Chinea, and Michael Themis

Lentivirus (LV) vectors offer permanent delivery of therapeutic genes to the host through an RNA intermediate genome. They are one of the most commonly used vectors for clinical gene therapy of inherited disorders such as immune deficiencies and cancer immunotherapy. One of the most difficult challenges facing their widespread application to patients is the large-scale production of highly pure vector stocks. To improve vector production and downstream purification, there has been a recent investment in the United Kingdom to establish good manufacturing process (GMP)-licensed centers for manufacture and quality control. Other requirements for these vectors include their target cell specificity and tropism, how to regulate gene expression of the therapeutic payload and their potential side effects. Comprehensive detail on the full nucleic acid content of LV is unknown, even though they have entered clinical trials. With potential adverse effects in mind, it is important to identify these contents to assess their safety and purity. In this study, we used highly sensitive PacBio long-distance, next-generation sequencing of reverse-transcribed vector component RNA to investigate the nucleic acid composition of recombinant HIV-1 particles generated by human 293T packaging cells. In this article, we describe our findings of nucleic acids other than the recombinant vector genome that exist, which could potentially be delivered during gene transfer, and suggest that removal of these unwanted components be considered before clinical LV application.
bioRxiv  |  2025

The human immunoglobulin heavy chain constant gene locus is enriched for large complex structural variants and coding polymorphisms that vary in frequency among human populations

Uddalok Jana, Oscar L. Rodriguez, William Lees, Eric Engelbrecht, Zach Vanwinkle, Ayelet Peres, William S. Gibson, Kaitlyn Shields, Steven Schultze, Abdullah Dorgham, Matthew Emery, Gintaras Deikus, Robert Sebra, Evan E. Eichler, Gur Yaari, Melissa L. Smith, Corey T. Watson

The immunoglobulin heavy chain constant (IGHC) domain of antibodies (Ab) is responsible for effector functions critical to Ab mediated immunity. In humans, this domain is encoded by genes within the IGHC locus, where descriptions of genomic diversity remain incomplete. To address this, we utilized long-read genomic datasets to build a high-quality IGHC haplotype/variant catalog from 105 individuals of diverse ancestry, and developed a high-throughput approach for targeted long-read IGHC locus sequencing and assembly. From locally phased assemblies, we discovered previously uncharacterized single nucleotide variants (SNV) and complex structural variants (SVs, n=7), as well as novel genes and alleles. Of the 262 identified IGHC coding alleles, 235 (89.6%) were undocumented. SNV, SV, and gene allele/genotype frequencies revealed significant population differentiation, including; (i) hundreds of SNVs in African and East Asian populations exceeding fixation index (FST) of 0.3, (ii) and an IGHG4 haplotype carrying specific coding variants uniquely enriched in East and South Asian populations. Our results illuminate missing signatures of haplotype diversity in the IGHC locus, including evidence of natural selection, and establish a new foundation for investigating IGHC germline variation and its role in Ab function and disease.
medRxiv  |  2025

Long-read sequencing resolves the clinically relevant CYP21A2 locus, supporting a new clinical test for Congenital Adrenal Hyperplasia

Jean Monlong, Xiao Chen, Hayk Barseghyan, William J Rowell, Shloka Negi, Natalie Nokoff, Lauren Mohnach, Josephine Hirsch, Courtney Finlayson, Catherine E. Keegan, Miguel Almalvez, Seth I. Berger, Ivan de Dios, Brandy McNulty, Alex Robertson, Karen H. Miga, Phyllis W. Speiser, Benedict Paten, Eric Vilain, Emmanuèle C. Délot

Both HiFi-based and nanopore-based whole-genome long-read sequencing datasets could be mined to accurately identify pathogenic single-nucleotide variants, full gene deletions, fusions creating non-functional hybrids between the gene and pseudogene (“30-kb deletion”), as well as count the number of RCCX modules and phase the resulting multimodular haplotypes. On the Hi-Fi data set of 6 samples, the PacBio Paraphase tool was able to distinguish nine different mono-, bi-, and tri-modular haplotypes, as well as the 30-kb and whole gene deletions. To do the same on the ONT-Nanopore dataset, we designed a tool, Parakit, which creates an enriched local pangenome to represent known haplotype assemblies and map ClinVar pathogenic variants and fusions onto them. With few labels in the region, optical genome mapping was not able to reliably resolve module counts or fusions, although designing a tool to mine the dataset specifically for this region may allow doing so in the future. Both sequencing techniques yielded congruent results, matching clinically identified variants, and offered additional information above the clinical test, including phasing, count of RCCX modules, and status of the other module genes, all of which may be of clinical relevance. Thus long-read sequencing could be used to identify variants causing multiple forms of CAH in a single test.
Oxford Academics  |  2025

Long and Accurate: How HiFi Sequencing is Transforming Genomics

Bo Wang, Peng Jia, Shenghan Gao, Huanhuan Zhao, Gaoyang Zheng, Linfeng Xu, Kai Ye

Recent developments in PacBio high-fidelity (HiFi) sequencing technologies have transformed genomic research, with circular consensus sequencing now achieving 99.9% accuracy for long (up to 25 kb) single-molecule reads. This method circumvents biases intrinsic to amplification-based approaches, enabling thorough analysis of complex genomic regions [including tandem repeats, segmental duplications, ribosomal DNA (rDNA) arrays, and centromeres] as well as direct detection of base modifications, furnishing both sequence and epigenetic data concurrently. This has streamlined a number of tasks including genome assembly, variant detection, and full-length transcript analysis. This review provides a comprehensive overview of the applications and challenges of HiFi sequencing across various fields, including genomics, transcriptomics, and epigenetics. By delineating the evolving landscape of HiFi sequencing in multi-omics research, we highlight its potential to deepen our understanding of genetic mechanisms and to advance precision medicine.
bioRxiv  |  2025

CiFi: Accurate long-read chromatin conformation capture with low-input requirements

Sean P. McGinty, Gulhan Kaya, Sheina B. Sim, Renée L. Corpuz, Michael A. Quail, Mara K. N. Lawniczak, Scott M. Geib, Jonas Korlach, Megan Y. Dennis

By coupling chromatin conformation capture (3C) with PacBio HiFi long-read sequencing, we have developed a new method (CiFi) that enables analysis of genome interactions across repetitive genomic regions with low-input requirements. CiFi produces multiple interacting concatemer segments per read, facilitating genome assembly and scaffolding. Together, the approach enables genomic analysis of previously recalcitrant low-complexity loci, and of small organisms such as single insect individuals.
bioRxiv  |  2025

CiFi: Accurate long-read chromatin conformation capture with low-input requirements

Sean P McGinty, Gulhan Kaya, Sheina B. Sim, Renée Lynn Corpuz, Michael A Quail, Mara KN Lawniczak, Scott M Geib, Jonas Korlach, Megan Y Dennis

By coupling chromatin conformation capture (3C) with PacBio HiFi long-read sequencing, we have developed a new method (CiFi) that enables analysis of genome interactions across repetitive genomic regions with low-input requirements. CiFi produces multiple interacting concatemer segments per read, facilitating genome assembly and scaffolding. Together, the approach enables genomic analysis of previously recalcitrant low-complexity loci, and of small organisms such as single insect individuals.
Nature  |  2025

A Near Complete Genome Assembly of the Oshima Cherry Cerasus speciosa

Kazumichi Fujiwara, Atsushi Toyoda, Bhim B. Biswa, Takushi Kishida, Momi Tsuruta, Yasukazu Nakamura, Noriko Kimura, Shoko Kawamoto, Yutaka Sato, Toshio Katsuki, Sakura 100 Genome Consortium & Tsuyoshi Koide

The Oshima cherry (Cerasus speciosa), which is endemic to Japan, has significant cultural and horticultural value. In this study, we present a near complete telomere-to-telomere genome assembly for C. speciosa, derived from the old growth “Sakurakkabu” tree on Izu Oshima Island. Using Illumina short-read, PacBio long-read, and Hi-C sequencing, we constructed a 269.3 Mbp genome assembly with a contig N50 of 32.0 Mbp. We examined the distribution of repetitive sequences in the assembled genome and identified regions that appeared to be centromeric. Detailed structural analysis of these putative centromeric regions revealed that the centromeric regions of C. speciosa comprised repetitive sequences with monomer lengths of 166 or 167 bp. Comparative genomic analysis with Prunus sensu lato genome revealed structural variations and conserved syntenic regions. This high-quality reference genome provides a crucial tool for studying the genetic diversity and evolutionary history of Cerasus species, facilitating advancements in horticultural research and the preservation of this iconic species.
Nature  |  2025

Seasonal recurrence and modular assembly of an Arctic pelagic marine microbiome

Taylor Priest, Ellen Oldenburg, Ovidiu Popa, Bledina Dede, Katja Metfies, Wilken-Jon von Appen, Sinhué Torres-Valdés, Christina Bienhold, Bernhard M. Fuchs, Rudolf Amann, Antje Boetius & Matthias Wietz

Deciphering how microbial communities are shaped by environmental variability is fundamental for understanding the structure and function of ocean ecosystems. While seasonal environmental gradients have been shown to structure the taxonomic dynamics of microbiomes over time, little is known about their impact on functional dynamics and the coupling between taxonomy and function. Here, we demonstrate annually recurrent, seasonal structuring of taxonomic and functional dynamics in a pelagic Arctic Ocean microbiome by combining autonomous samplers and in situ sensors with long-read metagenomics and SSU ribosomal metabarcoding. Specifically, we identified five temporal microbiome modules whose succession within each annual cycle represents a transition across different ecological states. For instance, Cand. Nitrosopumilus, Syndiniales, and the machinery to oxidise ammonia and reduce nitrite are signatures of early polar night, while late summer is characterised by Amylibacter and sulfur compound metabolism. Leveraging metatranscriptomes from Tara Oceans, we also demonstrate the consistency in functional dynamics across the wider Arctic Ocean during similar temporal periods. Furthermore, the structuring of genetic diversity within functions over time indicates that environmental selection pressure acts heterogeneously on microbiomes across seasons. By integrating taxonomic, functional and environmental information, our study provides fundamental insights into how microbiomes are structured under pronounced seasonal changes in understudied, yet rapidly changing polar marine ecosystems.
Nature  |  2025

Synchronized long-read genome, methylome, epigenome and transcriptome profiling resolve a Mendelian condition

Mitchell R. Vollger, Jonas Korlach, Kiara C. Eldred, Elliott Swanson, Jason G. Underwood, Stephanie C. Bohaczuk, Yizi Mao, Yong-Han H. Cheng, Jane Ranchalis, Elizabeth E. Blue, Ulrike Schwarze, Katherine M. Munson, Christopher T. Saunders, Aaron M. Wenger, Aimee Allworth, Sirisak Chanprasert, Brittney L. Duerden, Ian Glass, Martha Horike-Pyne, Michelle Kim, Kathleen A. Leppig, Ian J. McLaughlin, Jessica Ogawa, Elisabeth A. Rosenthal, University of Washington Center for Rare Disease Research, Undiagnosed Diseases Network, …Andrew B. Stergachis

Resolving the molecular basis of a Mendelian condition remains challenging owing to the diverse mechanisms by which genetic variants cause disease. To address this, we developed a synchronized long-read genome, methylome, epigenome and transcriptome sequencing approach, which enables accurate single-nucleotide, insertion–deletion and structural variant calling and diploid de novo genome assembly. This permits the simultaneous elucidation of haplotype-resolved CpG methylation, chromatin accessibility and full-length transcript information in a single long-read sequencing run. Application of this approach to an Undiagnosed Diseases Network participant with a chromosome X;13-balanced translocation of uncertain significance revealed that this translocation disrupted the functioning of four separate genes (NBEA, PDK3, MAB21L1 and RB1) previously associated with single-gene Mendelian conditions. Notably, the function of each gene was disrupted via a distinct mechanism that required integration of the four ‘omes’ to resolve. These included fusion transcript formation, enhancer adoption, transcriptional readthrough silencing and inappropriate X-chromosome inactivation of autosomal genes. Overall, this highlights the utility of synchronized long-read multi-omic profiling for mechanistically resolving complex phenotypes.
Keyword search
Author search
Year search

Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.