Menu

Scientific publications

Publications featuring PacBio long-read + short-read sequencing data

Liebert Pub  |  2025

Analysis of HIV-1-Based Lentiviral Vector Particle Composition by PacBio Long-Read Nucleic Acid Sequencing

Saqlain Suleman, Mohammad S. Khalifa, Serena Fawaz, Sharmin Alhaque, Yaghoub Chinea, and Michael Themis

Lentivirus (LV) vectors offer permanent delivery of therapeutic genes to the host through an RNA intermediate genome. They are one of the most commonly used vectors for clinical gene therapy of inherited disorders such as immune deficiencies and cancer immunotherapy. One of the most difficult challenges facing their widespread application to patients is the large-scale production of highly pure vector stocks. To improve vector production and downstream purification, there has been a recent investment in the United Kingdom to establish good manufacturing process (GMP)-licensed centers for manufacture and quality control. Other requirements for these vectors include their target cell specificity and tropism, how to regulate gene expression of the therapeutic payload and their potential side effects. Comprehensive detail on the full nucleic acid content of LV is unknown, even though they have entered clinical trials. With potential adverse effects in mind, it is important to identify these contents to assess their safety and purity. In this study, we used highly sensitive PacBio long-distance, next-generation sequencing of reverse-transcribed vector component RNA to investigate the nucleic acid composition of recombinant HIV-1 particles generated by human 293T packaging cells. In this article, we describe our findings of nucleic acids other than the recombinant vector genome that exist, which could potentially be delivered during gene transfer, and suggest that removal of these unwanted components be considered before clinical LV application.
bioRxiv  |  2025

The human immunoglobulin heavy chain constant gene locus is enriched for large complex structural variants and coding polymorphisms that vary in frequency among human populations

Uddalok Jana, Oscar L. Rodriguez, William Lees, Eric Engelbrecht, Zach Vanwinkle, Ayelet Peres, William S. Gibson, Kaitlyn Shields, Steven Schultze, Abdullah Dorgham, Matthew Emery, Gintaras Deikus, Robert Sebra, Evan E. Eichler, Gur Yaari, Melissa L. Smith, Corey T. Watson

The immunoglobulin heavy chain constant (IGHC) domain of antibodies (Ab) is responsible for effector functions critical to Ab mediated immunity. In humans, this domain is encoded by genes within the IGHC locus, where descriptions of genomic diversity remain incomplete. To address this, we utilized long-read genomic datasets to build a high-quality IGHC haplotype/variant catalog from 105 individuals of diverse ancestry, and developed a high-throughput approach for targeted long-read IGHC locus sequencing and assembly. From locally phased assemblies, we discovered previously uncharacterized single nucleotide variants (SNV) and complex structural variants (SVs, n=7), as well as novel genes and alleles. Of the 262 identified IGHC coding alleles, 235 (89.6%) were undocumented. SNV, SV, and gene allele/genotype frequencies revealed significant population differentiation, including; (i) hundreds of SNVs in African and East Asian populations exceeding fixation index (FST) of 0.3, (ii) and an IGHG4 haplotype carrying specific coding variants uniquely enriched in East and South Asian populations. Our results illuminate missing signatures of haplotype diversity in the IGHC locus, including evidence of natural selection, and establish a new foundation for investigating IGHC germline variation and its role in Ab function and disease.
Oxford Academics  |  2025

Long and Accurate: How HiFi Sequencing is Transforming Genomics

Bo Wang, Peng Jia, Shenghan Gao, Huanhuan Zhao, Gaoyang Zheng, Linfeng Xu, Kai Ye

Recent developments in PacBio high-fidelity (HiFi) sequencing technologies have transformed genomic research, with circular consensus sequencing now achieving 99.9% accuracy for long (up to 25 kb) single-molecule reads. This method circumvents biases intrinsic to amplification-based approaches, enabling thorough analysis of complex genomic regions [including tandem repeats, segmental duplications, ribosomal DNA (rDNA) arrays, and centromeres] as well as direct detection of base modifications, furnishing both sequence and epigenetic data concurrently. This has streamlined a number of tasks including genome assembly, variant detection, and full-length transcript analysis. This review provides a comprehensive overview of the applications and challenges of HiFi sequencing across various fields, including genomics, transcriptomics, and epigenetics. By delineating the evolving landscape of HiFi sequencing in multi-omics research, we highlight its potential to deepen our understanding of genetic mechanisms and to advance precision medicine.
bioRxiv  |  2025

CiFi: Accurate long-read chromatin conformation capture with low-input requirements

Sean P. McGinty, Gulhan Kaya, Sheina B. Sim, Renée L. Corpuz, Michael A. Quail, Mara K. N. Lawniczak, Scott M. Geib, Jonas Korlach, Megan Y. Dennis

By coupling chromatin conformation capture (3C) with PacBio HiFi long-read sequencing, we have developed a new method (CiFi) that enables analysis of genome interactions across repetitive genomic regions with low-input requirements. CiFi produces multiple interacting concatemer segments per read, facilitating genome assembly and scaffolding. Together, the approach enables genomic analysis of previously recalcitrant low-complexity loci, and of small organisms such as single insect individuals.
bioRxiv  |  2025

CiFi: Accurate long-read chromatin conformation capture with low-input requirements

Sean P McGinty, Gulhan Kaya, Sheina B. Sim, Renée Lynn Corpuz, Michael A Quail, Mara KN Lawniczak, Scott M Geib, Jonas Korlach, Megan Y Dennis

By coupling chromatin conformation capture (3C) with PacBio HiFi long-read sequencing, we have developed a new method (CiFi) that enables analysis of genome interactions across repetitive genomic regions with low-input requirements. CiFi produces multiple interacting concatemer segments per read, facilitating genome assembly and scaffolding. Together, the approach enables genomic analysis of previously recalcitrant low-complexity loci, and of small organisms such as single insect individuals.
Nature  |  2025

A Near Complete Genome Assembly of the Oshima Cherry Cerasus speciosa

Kazumichi Fujiwara, Atsushi Toyoda, Bhim B. Biswa, Takushi Kishida, Momi Tsuruta, Yasukazu Nakamura, Noriko Kimura, Shoko Kawamoto, Yutaka Sato, Toshio Katsuki, Sakura 100 Genome Consortium & Tsuyoshi Koide

The Oshima cherry (Cerasus speciosa), which is endemic to Japan, has significant cultural and horticultural value. In this study, we present a near complete telomere-to-telomere genome assembly for C. speciosa, derived from the old growth “Sakurakkabu” tree on Izu Oshima Island. Using Illumina short-read, PacBio long-read, and Hi-C sequencing, we constructed a 269.3 Mbp genome assembly with a contig N50 of 32.0 Mbp. We examined the distribution of repetitive sequences in the assembled genome and identified regions that appeared to be centromeric. Detailed structural analysis of these putative centromeric regions revealed that the centromeric regions of C. speciosa comprised repetitive sequences with monomer lengths of 166 or 167 bp. Comparative genomic analysis with Prunus sensu lato genome revealed structural variations and conserved syntenic regions. This high-quality reference genome provides a crucial tool for studying the genetic diversity and evolutionary history of Cerasus species, facilitating advancements in horticultural research and the preservation of this iconic species.
Nature  |  2025

Seasonal recurrence and modular assembly of an Arctic pelagic marine microbiome

Taylor Priest, Ellen Oldenburg, Ovidiu Popa, Bledina Dede, Katja Metfies, Wilken-Jon von Appen, Sinhué Torres-Valdés, Christina Bienhold, Bernhard M. Fuchs, Rudolf Amann, Antje Boetius & Matthias Wietz

Deciphering how microbial communities are shaped by environmental variability is fundamental for understanding the structure and function of ocean ecosystems. While seasonal environmental gradients have been shown to structure the taxonomic dynamics of microbiomes over time, little is known about their impact on functional dynamics and the coupling between taxonomy and function. Here, we demonstrate annually recurrent, seasonal structuring of taxonomic and functional dynamics in a pelagic Arctic Ocean microbiome by combining autonomous samplers and in situ sensors with long-read metagenomics and SSU ribosomal metabarcoding. Specifically, we identified five temporal microbiome modules whose succession within each annual cycle represents a transition across different ecological states. For instance, Cand. Nitrosopumilus, Syndiniales, and the machinery to oxidise ammonia and reduce nitrite are signatures of early polar night, while late summer is characterised by Amylibacter and sulfur compound metabolism. Leveraging metatranscriptomes from Tara Oceans, we also demonstrate the consistency in functional dynamics across the wider Arctic Ocean during similar temporal periods. Furthermore, the structuring of genetic diversity within functions over time indicates that environmental selection pressure acts heterogeneously on microbiomes across seasons. By integrating taxonomic, functional and environmental information, our study provides fundamental insights into how microbiomes are structured under pronounced seasonal changes in understudied, yet rapidly changing polar marine ecosystems.
Nature  |  2025

Synchronized long-read genome, methylome, epigenome and transcriptome profiling resolve a Mendelian condition

Mitchell R. Vollger, Jonas Korlach, Kiara C. Eldred, Elliott Swanson, Jason G. Underwood, Stephanie C. Bohaczuk, Yizi Mao, Yong-Han H. Cheng, Jane Ranchalis, Elizabeth E. Blue, Ulrike Schwarze, Katherine M. Munson, Christopher T. Saunders, Aaron M. Wenger, Aimee Allworth, Sirisak Chanprasert, Brittney L. Duerden, Ian Glass, Martha Horike-Pyne, Michelle Kim, Kathleen A. Leppig, Ian J. McLaughlin, Jessica Ogawa, Elisabeth A. Rosenthal, University of Washington Center for Rare Disease Research, Undiagnosed Diseases Network, …Andrew B. Stergachis

Resolving the molecular basis of a Mendelian condition remains challenging owing to the diverse mechanisms by which genetic variants cause disease. To address this, we developed a synchronized long-read genome, methylome, epigenome and transcriptome sequencing approach, which enables accurate single-nucleotide, insertion–deletion and structural variant calling and diploid de novo genome assembly. This permits the simultaneous elucidation of haplotype-resolved CpG methylation, chromatin accessibility and full-length transcript information in a single long-read sequencing run. Application of this approach to an Undiagnosed Diseases Network participant with a chromosome X;13-balanced translocation of uncertain significance revealed that this translocation disrupted the functioning of four separate genes (NBEA, PDK3, MAB21L1 and RB1) previously associated with single-gene Mendelian conditions. Notably, the function of each gene was disrupted via a distinct mechanism that required integration of the four ‘omes’ to resolve. These included fusion transcript formation, enhancer adoption, transcriptional readthrough silencing and inappropriate X-chromosome inactivation of autosomal genes. Overall, this highlights the utility of synchronized long-read multi-omic profiling for mechanistically resolving complex phenotypes.
Nature  |  2025

Evaluating the efficiency of 16S-ITS-23S operon sequencing for species level resolution in microbial communities

Rapid advancements in long-read sequencing have facilitated species-level microbial profiling through full-length 16S rRNA sequencing (~ 1500 bp), and more notably, by the newer 16S-ITS-23S ribosomal RNA operon (RRN) sequencing (~ 4500 bp). RRN sequencing is emerging as a superior method for species resolution, exceeding the capabilities of short-read and full-length 16S rRNA sequencing. However, being in its early stages of development, RRN sequencing has several underexplored or understudied elements, highlighting the need for a critical and thorough examination of its methodologies. Key areas that require detailed analysis include understanding how primer pairs, sequencing platforms, and classifiers and databases affect the accuracy of species resolution achieved through RRN sequencing. Our study addresses these gaps by evaluating the effect of primer pairs using four RRN primer combinations, and that of sequencing platforms by employing PacBio and Oxford Nanopore Technologies (ONT) systems. Furthermore, two classification methods (Minimap2 and OTU clustering), in combination with four RRN reference databases (MIrROR, rrnDB, and two versions of GROND) were compared to identify consistent and accurate classification methods with RRN sequencing. Here we demonstrate that RRN primer pair choice and sequencing platform do not substantially bias taxonomic profiles for most of the tested mock communities, while classification methods significantly impact the accuracy of species-level assignments. Of the classification methods tested, Minimap2 classifier in combination with the GROND database most consistently provided accurate species-level classification across the communities tested, irrespective of sequencing platform.
Clinical Genetics  |  2025

Haplotype Phasing of Biallelic WNT10B Variants Using Long-Read Sequencing in Split-Hand/Foot Malformation Syndrome

Jelena Pozojevic, Naseebullah Kakar, Henrike L. Sczakiel, Nathalie Kruse, Kristian Händler, Saranya Balachandran, Varun Sreenivasan, Martin A. Mensah, Malte Spielmann

Split-hand/foot malformation syndrome (SHFM) is a congenital limb malformation that is both clinically and genetically heterogeneous. Variants in WNT10B are known to cause an autosomal recessive form of SHFM. Here, we report a patient born to unrelated parents who was found to be a compound heterozygote for missense variants in WNT10B: c.994C>T, p.(Arg332Trp) and c.638T>G, p.(Phe213Cys). The variants were identified using long-read PacBio sequencing, which enabled phasing and confirmed that they were located on different alleles. The maternally inherited variant p.(Arg332Trp) has been previously reported, whereas the paternally inherited variant p.(Phe213Cys) is novel and absent from the gnomAD database. Our findings highlight the utility of long-read haplotype phasing, which provides valuable insights in determining the biallelic nature of variants in recessive disorders when parental DNA samples are unavailable.
bioRxiv  |  2025

Long-Read Low-Pass Sequencing for High-Resolution Trait Mapping

Kendall Lee, Walid Korani, Philip C. Bentz, Sameer Pokhrel, Peggy Ozias-Akins, Alex Harkess, Justin Vaughn, Josh Clevenger

Accelerating crop improvement is critical to meeting food security demands in a changing climate. Long-read sequencing offers advantages over short-reads in resolving structural variations (SVs) and aligning to complex genomes, but its high cost has limited adoption in breeding programs. Here we develop a high-throughput, scalable approach for long-read low-pass (LRLP) sequencing and variant analysis with PacBio HiFi reads, and apply it to trait mapping in a complex tetraploid peanut (Arachis hypogaea) genome multi-parent advanced generation intercross. We analyze LRLP using both a single reference genome and a pangraph, using both proprietary and open-source tools to analyze SVs and coverage. An increased number of variants are consistently called for LRLP data compared to short-read data. At 1.63x average depth, LRLP sequencing covered 55% of the genome and 58% of gene space, outperforming 1.68x depth short-read low-pass sequencing, which achieved only 17% and 11%, respectively. Enhanced data retention after filtering for probabilistic misalignment and an ∼8.5x decrease in cost per value further demonstrated LRLP’s efficacy. Our results highlight LRLP sequencing as a scalable, cost-effective tool for high-resolution trait mapping, with transformative potential for plant breeding and broader genomic applications.
bioRxiv  |  2025

Epigenetic phase variation in the gut microbiome enhances bacterial adaptation

Mi Ni, Yu Fan, Yujie Liu, Yangmei Li, Wanjin Qiao, Lauren E. Davey, Xue-Song Zhang, Magdalena Ksiezarek, Edward Mead, Alan Touracheau, Wenyan Jiang, Martin J. Blaser, Raphael H. Valdivia, Gang Fang

The human gut microbiome within the gastrointestinal tract continuously adapts to variations in diet, medications, and host physiology. A central strategy for genetic adaptation is epigenetic phase variation (ePV) mediated by bacterial DNA methylation, which can regulate gene expression, enhance clonal heterogeneity, and enable a single bacterial strain to exhibit variable phenotypic states. Genome-wide and site-specific ePV have been well characterized in human pathogens’ antigenic variation and virulence factor production. However, the role of ePV in facilitating adaptation within the human microbiome remains poorly understood. Here, we comprehensively cataloged genome-wide and site-specific ePV in human infant and adult gut microbiomes. First, using long-read metagenomic sequencing, we detected genome-wide ePV mediated by complex structural variations of DNA methyltransferases, highlighting the ones associated with antibiotics or fecal microbiota transplantation. Second, we analyzed an extensive collection of public short-read metagenomic sequencing datasets, uncovering a greater prevalence of genome-wide ePV in the human gut microbiome. Third, we quantitatively detected site-specific ePVs using single-molecule methylation analysis to identify dynamic variations associated with antibiotic treatment or probiotic engraftment. Finally, we performed an in-depth assessment of an Akkermansia muciniphila isolate from an infant, highlighting that ePV can regulate gene expression and enhance the bacterial adaptive capacity by employing a bet-hedging strategy to increase tolerance to differing antibiotics. Our findings indicate that epigenetic modifications are a common and broad strategy used by bacteria in the human gut to adapt to their environment.
bioRxiv  |  2025

Detailed tandem repeat allele profiling in 1,027 long-read genomes reveals genome-wide patterns of pathogenicity

Matt C. Danzi, Isaac R. L. Xu, Sarah Fazal, Egor Dolzhenko, David Pellerin, Ben Weisburd, Chloe Reuter, Jacinda Sampson, Chiara Folland, Matthew Wheeler, Anne O’Donnell-Luria, Stefan Wuchty, Gianina Ravenscroft, Michael A. Eberle, All of Us Research Program Long Read Working Group, Stephan Zuchner

Tandem repeats are a highly polymorphic class of genomic variation that play causal roles in rare diseases but are notoriously difficult to sequence using short-read techniques1,2. Most previous studies profiling tandem repeats genome-wide have reduced the description of each locus to the singular value of the length of the entire repetitive locus3,4. Here we introduce a comprehensive database of 3.6 billion tandem repeat allele sequences from over one thousand individuals using HiFi long-read sequencing. We show that the previously identified pathogenic loci are among the most variable tandem repeat loci in the genome, when incorporating nucleotide resolution sequence content to measure the longest pure motif segment. More broadly, we introduce a novel measure, ‘tandem repeat constraint’, that assists in distinguishing potentially pathogenic from benign loci. Our approach of measuring variation as ‘the length of the longest pure segment’ successfully prioritizes pathogenic repeats within their previously published linkage regions. We also present evidence for two novel pathogenic repeat expansion candidates. In summary, this analysis significantly clarifies the potential for short tandem repeat pathogenicity at over 1.7 million tandem repeat loci and will aid the identification of disease-causing repeat expansions.
Genome Research  |  2024

Evaluation of strategies for evidence-driven genome annotation using long-read RNA-seq

Alejandro Paniagua1, Cristina Agustin-García, Francisco J Pardo-Palacios, Thomas Brown, Maite De Maria, Nancy D Denslow, Camila Mazzoni and Ana Conesa

While the production of a draft genome has become more accessible due to long-read sequencing, the annotation of these new genomes has not been developed at the same pace. Long-read RNA sequencing (lrRNA-seq) offers a promising solution for enhancing gene annotation. In this study, we explore how sequencing platforms, Oxford Nanopore R9.4.1 chemistry or PacBio Sequel II CCS, and data processing methods influence evidence-driven genome annotation using long reads. Incorporating PacBio transcripts into our annotation pipeline significantly outperformed traditional methods, such as ab initio predictions and short-read-based annotations. We applied this strategy to a nonmodel species, the Florida manatee, and compared our results to existing short-read-based annotation. At the loci level, both annotations were highly concordant, with 90% agreement. However, at the transcript level, the agreement was only 35%. We identified 4,906 novel loci, represented by 5,707 isoforms, with 64% of these isoforms matching known sequences in other mammalian species. Overall, our findings underscore the importance of using high-quality curated transcript models in combination with ab initio methods for effective genome annotation.
bioRxiv  |  2024

An Emirati pangenome incorporating a diploid telomere-to-telomere reference

Michael Olbrich, Mira Mousa, Inken Wohlers, Amira Al Aamri, Halima Alnaqbi, Aisha Hanaya Alsuwaidi, Hima Vadakkeveettil Manoharan, Nour al-dain Marzouka, Sanjay Erathodi Ramachandran, Anju Annie Thomas, Mohammed Alameri, Guan K Tay, Rifat Hamoudi, Saleh Ibrahim, Noura Al Ghaithi, Habiba Alsafar

Reference data on genomic variation forms the basis of genetics research. Limitations in identifying genetic variation from single reference sequences have recently been addressed through improvements in sequencing technologies, allowing the generation of pangenomic references from multiple accurate chromosome-level de novo assemblies. Nevertheless, global pangenomes to date have yet to include genomes from the populations of the Middle Eastern Region. To address this shortcoming, this study provides an Emirati genome reference. Its core is a diploid assembly with a Quality Value (QV) of 60 that includes ten telomere-to-telomere chromosomes. This assembly is incorporated into a pangenome graph constructed of 52 additional high-quality assemblies, half of which are trio-based. This Emirati pangenome reveals a similar level of genomic variation as the one compiled by the Human Pangenome Reference Consortium, underscoring its utility for the identification of both global and population-centered genomic variation, even in genome regions that have been traditionally challenging to assemble but are covered by the Emirati telomere-to-telomere assembly. As such, the Emirati genome reference significantly contributes to genomic research globally and is an essential resource for genomics-based personalized medicine in the United Arab Emirates and other parts of the Middle East.
Keyword search
Author search
Year search

Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.