Menu

Scientific publications

Publications featuring PacBio long-read + short-read sequencing data

bioRxiv  |  2025

Genetic diversity and regulatory features of human-specific NOTCH2NL duplications

Taylor D. Real, Prajna Hebbar, DongAhn Yoo, Francesca Antonacci, Ivana Pačar, Mark Diekhans, Gregory J. Mikol, Oyeronke G. Popoola, Benjamin J. Mallory, Mitchell R. Vollger, Philip C. Dishuck, Xavi Guitart, Allison N. Rozanski, Katherine M. Munson, Kendra Hoekzema, Jane E. Ranchalis, Shane J. Neph, Adriana E. Sedeño-Cortes, Benedict Paten, Sofie R. Salama, Andrew B. Stergachis, Evan E. Eichler

NOTCH2NL (NOTCH2-N-terminus-like) genes arose from incomplete, recent chromosome 1 segmental duplications implicated in human brain cortical expansion. Genetic characterization of these loci and their regulation is complicated by the fact they are embedded in large, nearly identical duplications that predispose to recurrent microdeletion syndromes. Using nearly complete long-read assemblies generated from 67 human and 12 ape haploid genomes, we show independent recurrent duplication among apes with functional copies emerging in humans ∼2.1 million years ago. We distinguish NOTCH2NL paralogs present in every human haplotype (NOTCH2NLA) from copy number variable ones. We also characterize large-scale structural variation, including gene conversion, for 28% of haplotypes leading to a previously undescribed paralog, NOTCH2tv. Finally, we apply Fiber-seq and long-read transcript sequencing to human cortical neurospheres to characterize the regulatory landscape and find that the most fixed paralogs, NOTCH2 and NOTCH2NLA, harbor the greatest number of paralog-specific elements potentially driving their regulation.
Nature  |  2025

Solanum pan-genetics reveals paralogues as contingencies in crop engineering

Benoit, M., Jenike, K.M., Satterlee, J.W. et al.

Pan-genomics and genome-editing technologies are revolutionizing breeding of global crops1,2. A transformative opportunity lies in exchanging genotype-to-phenotype knowledge between major crops (that is, those cultivated globally) and indigenous crops (that is, those locally cultivated within a circumscribed area)3,4,5 to enhance our food system. However, species-specific genetic variants and their interactions with desirable natural or engineered mutations pose barriers to achieving predictable phenotypic effects, even between related crops6,7. Here, by establishing a pan-genome of the crop-rich genus Solanum8 and integrating functional genomics and pan-genetics, we show that gene duplication and subsequent paralogue diversification are major obstacles to genotype-to-phenotype predictability. Despite broad conservation of gene macrosynteny among chromosome-scale references for 22 species, including 13 indigenous crops, thousands of gene duplications, particularly within key domestication gene families, exhibited dynamic trajectories in sequence, expression and function. By augmenting our pan-genome with African eggplant cultivars9 and applying quantitative genetics and genome editing, we dissected an intricate history of paralogue evolution affecting fruit size. The loss of a redundant paralogue of the classical fruit size regulator CLAVATA3 (CLV3)10,11 was compensated by a lineage-specific tandem duplication. Subsequent pseudogenization of the derived copy, followed by a large cultivar-specific deletion, created a single fused CLV3 allele that modulates fruit organ number alongside an enzymatic gene controlling the same trait. Our findings demonstrate that paralogue diversifications over short timescales are underexplored contingencies in trait evolvability. Exposing and navigating these contingencies is crucial for translating genotype-to-phenotype relationships across species.
Nature Genetics  |  2025

Long-read RNA sequencing atlas of human microglia isoforms elucidates disease-associated genetic regulation of splicing

Humphrey, J., Brophy, E., Kosoy, R. et al.

Microglia, the innate immune cells of the central nervous system, have been genetically implicated in multiple neurodegenerative diseases. Mapping the genetics of gene expression in human microglia has identified several loci associated with disease-associated genetic variants in microglia-specific regulatory elements. However, identifying genetic effects on splicing is challenging because of the use of short sequencing reads. Here, we present the isoform-centric microglia genomic atlas (isoMiGA), which leverages long-read RNA sequencing to identify 35,879 novel microglia isoforms. We show that these isoforms are involved in stimulation response and brain region specificity. We then quantified the expression of both known and novel isoforms in a multi-ancestry meta-analysis of 555 human microglia short-read RNA sequencing samples from 391 donors, and found associations with genetic risk loci in Alzheimer’s and Parkinson’s disease. We nominate several loci that may act through complex changes in isoform and splice-site usage.
Liebert Pub  |  2025

Analysis of HIV-1-Based Lentiviral Vector Particle Composition by PacBio Long-Read Nucleic Acid Sequencing

Saqlain Suleman, Mohammad S. Khalifa, Serena Fawaz, Sharmin Alhaque, Yaghoub Chinea, and Michael Themis

Lentivirus (LV) vectors offer permanent delivery of therapeutic genes to the host through an RNA intermediate genome. They are one of the most commonly used vectors for clinical gene therapy of inherited disorders such as immune deficiencies and cancer immunotherapy. One of the most difficult challenges facing their widespread application to patients is the large-scale production of highly pure vector stocks. To improve vector production and downstream purification, there has been a recent investment in the United Kingdom to establish good manufacturing process (GMP)-licensed centers for manufacture and quality control. Other requirements for these vectors include their target cell specificity and tropism, how to regulate gene expression of the therapeutic payload and their potential side effects. Comprehensive detail on the full nucleic acid content of LV is unknown, even though they have entered clinical trials. With potential adverse effects in mind, it is important to identify these contents to assess their safety and purity. In this study, we used highly sensitive PacBio long-distance, next-generation sequencing of reverse-transcribed vector component RNA to investigate the nucleic acid composition of recombinant HIV-1 particles generated by human 293T packaging cells. In this article, we describe our findings of nucleic acids other than the recombinant vector genome that exist, which could potentially be delivered during gene transfer, and suggest that removal of these unwanted components be considered before clinical LV application.
bioRxiv  |  2025

The human immunoglobulin heavy chain constant gene locus is enriched for large complex structural variants and coding polymorphisms that vary in frequency among human populations

Uddalok Jana, Oscar L. Rodriguez, William Lees, Eric Engelbrecht, Zach Vanwinkle, Ayelet Peres, William S. Gibson, Kaitlyn Shields, Steven Schultze, Abdullah Dorgham, Matthew Emery, Gintaras Deikus, Robert Sebra, Evan E. Eichler, Gur Yaari, Melissa L. Smith, Corey T. Watson

The immunoglobulin heavy chain constant (IGHC) domain of antibodies (Ab) is responsible for effector functions critical to Ab mediated immunity. In humans, this domain is encoded by genes within the IGHC locus, where descriptions of genomic diversity remain incomplete. To address this, we utilized long-read genomic datasets to build a high-quality IGHC haplotype/variant catalog from 105 individuals of diverse ancestry, and developed a high-throughput approach for targeted long-read IGHC locus sequencing and assembly. From locally phased assemblies, we discovered previously uncharacterized single nucleotide variants (SNV) and complex structural variants (SVs, n=7), as well as novel genes and alleles. Of the 262 identified IGHC coding alleles, 235 (89.6%) were undocumented. SNV, SV, and gene allele/genotype frequencies revealed significant population differentiation, including; (i) hundreds of SNVs in African and East Asian populations exceeding fixation index (FST) of 0.3, (ii) and an IGHG4 haplotype carrying specific coding variants uniquely enriched in East and South Asian populations. Our results illuminate missing signatures of haplotype diversity in the IGHC locus, including evidence of natural selection, and establish a new foundation for investigating IGHC germline variation and its role in Ab function and disease.
medRxiv  |  2025

Long-read sequencing resolves the clinically relevant CYP21A2 locus, supporting a new clinical test for Congenital Adrenal Hyperplasia

Jean Monlong, Xiao Chen, Hayk Barseghyan, William J Rowell, Shloka Negi, Natalie Nokoff, Lauren Mohnach, Josephine Hirsch, Courtney Finlayson, Catherine E. Keegan, Miguel Almalvez, Seth I. Berger, Ivan de Dios, Brandy McNulty, Alex Robertson, Karen H. Miga, Phyllis W. Speiser, Benedict Paten, Eric Vilain, Emmanuèle C. Délot

Both HiFi-based and nanopore-based whole-genome long-read sequencing datasets could be mined to accurately identify pathogenic single-nucleotide variants, full gene deletions, fusions creating non-functional hybrids between the gene and pseudogene (“30-kb deletion”), as well as count the number of RCCX modules and phase the resulting multimodular haplotypes. On the Hi-Fi data set of 6 samples, the PacBio Paraphase tool was able to distinguish nine different mono-, bi-, and tri-modular haplotypes, as well as the 30-kb and whole gene deletions. To do the same on the ONT-Nanopore dataset, we designed a tool, Parakit, which creates an enriched local pangenome to represent known haplotype assemblies and map ClinVar pathogenic variants and fusions onto them. With few labels in the region, optical genome mapping was not able to reliably resolve module counts or fusions, although designing a tool to mine the dataset specifically for this region may allow doing so in the future. Both sequencing techniques yielded congruent results, matching clinically identified variants, and offered additional information above the clinical test, including phasing, count of RCCX modules, and status of the other module genes, all of which may be of clinical relevance. Thus long-read sequencing could be used to identify variants causing multiple forms of CAH in a single test.
Oxford Academics  |  2025

Long and Accurate: How HiFi Sequencing is Transforming Genomics

Bo Wang, Peng Jia, Shenghan Gao, Huanhuan Zhao, Gaoyang Zheng, Linfeng Xu, Kai Ye

Recent developments in PacBio high-fidelity (HiFi) sequencing technologies have transformed genomic research, with circular consensus sequencing now achieving 99.9% accuracy for long (up to 25 kb) single-molecule reads. This method circumvents biases intrinsic to amplification-based approaches, enabling thorough analysis of complex genomic regions [including tandem repeats, segmental duplications, ribosomal DNA (rDNA) arrays, and centromeres] as well as direct detection of base modifications, furnishing both sequence and epigenetic data concurrently. This has streamlined a number of tasks including genome assembly, variant detection, and full-length transcript analysis. This review provides a comprehensive overview of the applications and challenges of HiFi sequencing across various fields, including genomics, transcriptomics, and epigenetics. By delineating the evolving landscape of HiFi sequencing in multi-omics research, we highlight its potential to deepen our understanding of genetic mechanisms and to advance precision medicine.
bioRxiv  |  2025

CiFi: Accurate long-read chromatin conformation capture with low-input requirements

Sean P McGinty, Gulhan Kaya, Sheina B. Sim, Renée Lynn Corpuz, Michael A Quail, Mara KN Lawniczak, Scott M Geib, Jonas Korlach, Megan Y Dennis

By coupling chromatin conformation capture (3C) with PacBio HiFi long-read sequencing, we have developed a new method (CiFi) that enables analysis of genome interactions across repetitive genomic regions with low-input requirements. CiFi produces multiple interacting concatemer segments per read, facilitating genome assembly and scaffolding. Together, the approach enables genomic analysis of previously recalcitrant low-complexity loci, and of small organisms such as single insect individuals.
bioRxiv  |  2025

CiFi: Accurate long-read chromatin conformation capture with low-input requirements

Sean P. McGinty, Gulhan Kaya, Sheina B. Sim, Renée L. Corpuz, Michael A. Quail, Mara K. N. Lawniczak, Scott M. Geib, Jonas Korlach, Megan Y. Dennis

By coupling chromatin conformation capture (3C) with PacBio HiFi long-read sequencing, we have developed a new method (CiFi) that enables analysis of genome interactions across repetitive genomic regions with low-input requirements. CiFi produces multiple interacting concatemer segments per read, facilitating genome assembly and scaffolding. Together, the approach enables genomic analysis of previously recalcitrant low-complexity loci, and of small organisms such as single insect individuals.
Nature  |  2025

A Near Complete Genome Assembly of the Oshima Cherry Cerasus speciosa

Kazumichi Fujiwara, Atsushi Toyoda, Bhim B. Biswa, Takushi Kishida, Momi Tsuruta, Yasukazu Nakamura, Noriko Kimura, Shoko Kawamoto, Yutaka Sato, Toshio Katsuki, Sakura 100 Genome Consortium & Tsuyoshi Koide

The Oshima cherry (Cerasus speciosa), which is endemic to Japan, has significant cultural and horticultural value. In this study, we present a near complete telomere-to-telomere genome assembly for C. speciosa, derived from the old growth “Sakurakkabu” tree on Izu Oshima Island. Using Illumina short-read, PacBio long-read, and Hi-C sequencing, we constructed a 269.3 Mbp genome assembly with a contig N50 of 32.0 Mbp. We examined the distribution of repetitive sequences in the assembled genome and identified regions that appeared to be centromeric. Detailed structural analysis of these putative centromeric regions revealed that the centromeric regions of C. speciosa comprised repetitive sequences with monomer lengths of 166 or 167 bp. Comparative genomic analysis with Prunus sensu lato genome revealed structural variations and conserved syntenic regions. This high-quality reference genome provides a crucial tool for studying the genetic diversity and evolutionary history of Cerasus species, facilitating advancements in horticultural research and the preservation of this iconic species.
Nature  |  2025

Seasonal recurrence and modular assembly of an Arctic pelagic marine microbiome

Taylor Priest, Ellen Oldenburg, Ovidiu Popa, Bledina Dede, Katja Metfies, Wilken-Jon von Appen, Sinhué Torres-Valdés, Christina Bienhold, Bernhard M. Fuchs, Rudolf Amann, Antje Boetius & Matthias Wietz

Deciphering how microbial communities are shaped by environmental variability is fundamental for understanding the structure and function of ocean ecosystems. While seasonal environmental gradients have been shown to structure the taxonomic dynamics of microbiomes over time, little is known about their impact on functional dynamics and the coupling between taxonomy and function. Here, we demonstrate annually recurrent, seasonal structuring of taxonomic and functional dynamics in a pelagic Arctic Ocean microbiome by combining autonomous samplers and in situ sensors with long-read metagenomics and SSU ribosomal metabarcoding. Specifically, we identified five temporal microbiome modules whose succession within each annual cycle represents a transition across different ecological states. For instance, Cand. Nitrosopumilus, Syndiniales, and the machinery to oxidise ammonia and reduce nitrite are signatures of early polar night, while late summer is characterised by Amylibacter and sulfur compound metabolism. Leveraging metatranscriptomes from Tara Oceans, we also demonstrate the consistency in functional dynamics across the wider Arctic Ocean during similar temporal periods. Furthermore, the structuring of genetic diversity within functions over time indicates that environmental selection pressure acts heterogeneously on microbiomes across seasons. By integrating taxonomic, functional and environmental information, our study provides fundamental insights into how microbiomes are structured under pronounced seasonal changes in understudied, yet rapidly changing polar marine ecosystems.
Nature  |  2025

Synchronized long-read genome, methylome, epigenome and transcriptome profiling resolve a Mendelian condition

Mitchell R. Vollger, Jonas Korlach, Kiara C. Eldred, Elliott Swanson, Jason G. Underwood, Stephanie C. Bohaczuk, Yizi Mao, Yong-Han H. Cheng, Jane Ranchalis, Elizabeth E. Blue, Ulrike Schwarze, Katherine M. Munson, Christopher T. Saunders, Aaron M. Wenger, Aimee Allworth, Sirisak Chanprasert, Brittney L. Duerden, Ian Glass, Martha Horike-Pyne, Michelle Kim, Kathleen A. Leppig, Ian J. McLaughlin, Jessica Ogawa, Elisabeth A. Rosenthal, University of Washington Center for Rare Disease Research, Undiagnosed Diseases Network, …Andrew B. Stergachis

Resolving the molecular basis of a Mendelian condition remains challenging owing to the diverse mechanisms by which genetic variants cause disease. To address this, we developed a synchronized long-read genome, methylome, epigenome and transcriptome sequencing approach, which enables accurate single-nucleotide, insertion–deletion and structural variant calling and diploid de novo genome assembly. This permits the simultaneous elucidation of haplotype-resolved CpG methylation, chromatin accessibility and full-length transcript information in a single long-read sequencing run. Application of this approach to an Undiagnosed Diseases Network participant with a chromosome X;13-balanced translocation of uncertain significance revealed that this translocation disrupted the functioning of four separate genes (NBEA, PDK3, MAB21L1 and RB1) previously associated with single-gene Mendelian conditions. Notably, the function of each gene was disrupted via a distinct mechanism that required integration of the four ‘omes’ to resolve. These included fusion transcript formation, enhancer adoption, transcriptional readthrough silencing and inappropriate X-chromosome inactivation of autosomal genes. Overall, this highlights the utility of synchronized long-read multi-omic profiling for mechanistically resolving complex phenotypes.
Nature  |  2025

Evaluating the efficiency of 16S-ITS-23S operon sequencing for species level resolution in microbial communities

Rapid advancements in long-read sequencing have facilitated species-level microbial profiling through full-length 16S rRNA sequencing (~ 1500 bp), and more notably, by the newer 16S-ITS-23S ribosomal RNA operon (RRN) sequencing (~ 4500 bp). RRN sequencing is emerging as a superior method for species resolution, exceeding the capabilities of short-read and full-length 16S rRNA sequencing. However, being in its early stages of development, RRN sequencing has several underexplored or understudied elements, highlighting the need for a critical and thorough examination of its methodologies. Key areas that require detailed analysis include understanding how primer pairs, sequencing platforms, and classifiers and databases affect the accuracy of species resolution achieved through RRN sequencing. Our study addresses these gaps by evaluating the effect of primer pairs using four RRN primer combinations, and that of sequencing platforms by employing PacBio and Oxford Nanopore Technologies (ONT) systems. Furthermore, two classification methods (Minimap2 and OTU clustering), in combination with four RRN reference databases (MIrROR, rrnDB, and two versions of GROND) were compared to identify consistent and accurate classification methods with RRN sequencing. Here we demonstrate that RRN primer pair choice and sequencing platform do not substantially bias taxonomic profiles for most of the tested mock communities, while classification methods significantly impact the accuracy of species-level assignments. Of the classification methods tested, Minimap2 classifier in combination with the GROND database most consistently provided accurate species-level classification across the communities tested, irrespective of sequencing platform.
Clinical Genetics  |  2025

Haplotype Phasing of Biallelic WNT10B Variants Using Long-Read Sequencing in Split-Hand/Foot Malformation Syndrome

Jelena Pozojevic, Naseebullah Kakar, Henrike L. Sczakiel, Nathalie Kruse, Kristian Händler, Saranya Balachandran, Varun Sreenivasan, Martin A. Mensah, Malte Spielmann

Split-hand/foot malformation syndrome (SHFM) is a congenital limb malformation that is both clinically and genetically heterogeneous. Variants in WNT10B are known to cause an autosomal recessive form of SHFM. Here, we report a patient born to unrelated parents who was found to be a compound heterozygote for missense variants in WNT10B: c.994C>T, p.(Arg332Trp) and c.638T>G, p.(Phe213Cys). The variants were identified using long-read PacBio sequencing, which enabled phasing and confirmed that they were located on different alleles. The maternally inherited variant p.(Arg332Trp) has been previously reported, whereas the paternally inherited variant p.(Phe213Cys) is novel and absent from the gnomAD database. Our findings highlight the utility of long-read haplotype phasing, which provides valuable insights in determining the biallelic nature of variants in recessive disorders when parental DNA samples are unavailable.
bioRxiv  |  2025

Long-Read Low-Pass Sequencing for High-Resolution Trait Mapping

Kendall Lee, Walid Korani, Philip C. Bentz, Sameer Pokhrel, Peggy Ozias-Akins, Alex Harkess, Justin Vaughn, Josh Clevenger

Accelerating crop improvement is critical to meeting food security demands in a changing climate. Long-read sequencing offers advantages over short-reads in resolving structural variations (SVs) and aligning to complex genomes, but its high cost has limited adoption in breeding programs. Here we develop a high-throughput, scalable approach for long-read low-pass (LRLP) sequencing and variant analysis with PacBio HiFi reads, and apply it to trait mapping in a complex tetraploid peanut (Arachis hypogaea) genome multi-parent advanced generation intercross. We analyze LRLP using both a single reference genome and a pangraph, using both proprietary and open-source tools to analyze SVs and coverage. An increased number of variants are consistently called for LRLP data compared to short-read data. At 1.63x average depth, LRLP sequencing covered 55% of the genome and 58% of gene space, outperforming 1.68x depth short-read low-pass sequencing, which achieved only 17% and 11%, respectively. Enhanced data retention after filtering for probabilistic misalignment and an ∼8.5x decrease in cost per value further demonstrated LRLP’s efficacy. Our results highlight LRLP sequencing as a scalable, cost-effective tool for high-resolution trait mapping, with transformative potential for plant breeding and broader genomic applications.
Keyword search
Author search
Year search

Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.