In a Nature Methods paper released today, scientists describe the new bioinformatics tools to produce diploid genome assemblies from SMRT Sequencing reads. FALCON (Fast ALignment and CONsensus for assembly) and FALCON-Unzip were developed by PacBio scientists in collaboration with researchers at Johns Hopkins University, Cold Spring Harbor Laboratory, the Joint Genome Institute, and other institutions.
“Phased diploid genome assembly with single-molecule real-time sequencing” comes from lead authors Chen-Shan Chin and Paul Peluso, senior author Michael Schatz, and collaborators. In the publication, the team details how FALCON and FALCON-Unzip work and presents data from several validation studies of organisms including Arabidopsis, the Cabernet Sauvignon grape, and the diploid fungus Clavicorona pyxidata.
“Currently available genome assemblies rarely capture the heterozygosity present within a diploid or polyploid species,” Chin et al. write. “Most assemblers output a mosaic genome sequence that arbitrarily alternates between parental alleles.” That leads to a loss of important information about differences between homologous chromosomes. To address this issue, the team developed the diploid-aware FALCON assembler and FALCON-Unzip, a tool for resolving haplotypes. Both tools are open-source.
As the authors describe it, “The FALCON assembler follows the design of the hierarchical genome assembly process (HGAP) but uses more computationally optimized components.” FALCON builds a string graph with bubbles representing differences between paired chromosomes. “FALCON-Unzip identifies read haplotypes using phasing information from heterozygous positions that it identifies,” they add. The phased reads are used to construct contigs for both haplotypes as well as the unique sequence for each chromosome, resulting in a “final diploid assembly with phased single-nucleotide polymorphisms (SNPs) and structural variants (SVs).”
The team assembled a trio of Arabidopsis plants for validating the accuracy of the haplotype speration, then applied the tools to the fungus and wine grape genomes. “In all three genomes that we studied, the FALCON/FALCON-Unzip assembly was two- to three-fold more contiguous than alternative long-read assemblers and 30- to >100-fold more contiguous than state-of-the-art short-read assemblers,” they report. In Arabidopsis, for instance, they were able to resolve haplotype chromosomes for almost the entire genome. In the V. vinifera grape, the diploid assembly revealed high variation rates in homologous regions, and in C. pyxidata it showed long stretches of much lower heterozygosity than expected.
This new view of genomes could have major implications for characterizing methylation, gene expression, and regulatory elements. “More systematic study of phased diploid references will expose the detailed cis-regulatory mechanisms of differential expression in diploid genomes to improve our general understanding of the biology beyond haploid genomes,” the scientists write. “Looking forward, we expect many new opportunities for understanding diploid and polyploid genomic diversity and its impact on genome annotation, gene regulation, and evolution.”
October 17, 2016 | General