Highly mutable RNA viruses such as influenza A virus, human immunodeficiency virus and hepatitis C virus exist in infected hosts as highly heterogeneous populations of closely related genomic variants. The presence of low-frequency variants with few mutations with respect to major strains may result in an immune escape, emergence of drug resistance, and an increase of virulence and infectivity. Next-generation sequencing technologies permit detection of sample intra-host viral population at extremely great depth, thus providing an opportunity to access low-frequency variants. Long read lengths offered by single-molecule sequencing technologies allow all viral variants to be sequenced in a single pass. However, high sequencing error rates limit the ability to study heterogeneous viral populations composed of rare, closely related variants. In this article, we present CliqueSNV, a novel reference-based method for reconstruction of viral variants from NGS data. It efficiently constructs an allele graph based on linkage between single nucleotide variations and identifies true viral variants by merging cliques of that graph using combinatorial optimization techniques. The new method outperforms existing methods in both accuracy and running time on experimental and simulated NGS data for titrated levels of known viral variants. For PacBio reads, it accurately reconstructs variants with frequency as low as 0.1%. For Illumina reads, it fully reconstructs main variants. The open source implementation of CliqueSNV is freely available for download at https://github.com/vyacheslav-tsivina/CliqueSNV
Journal: BioRxiv
DOI: 10.1101/264242
Year: 2018