At Cold Spring Harbor Laboratory, scientists used SMRT® Sequencing to decode one of the most challenging cancer genomes ever encountered. Along the way, they built a portfolio of open-access analysis tools that will help researchers everywhere make structural variation discoveries with long-read sequencing data.
When Mike Schatz realized a few years ago that his PacBio® System had reached the throughput needed to process human genomes, he decided to give it a real challenge: the incredibly complicated, massively rearranged SK-BR-3 breast cancer cell line. The genome consists of 80 chromosomes, and that’s just the tip of the complexity iceberg.
“We were really interested in sequencing a human genome that would be maximally impactful and that was aligned with our research interest in cancer genomes, where it’s been well documented that structural variations play a major role,” says Schatz, now an associate research professor of computer science at Johns Hopkins University and an adjunct associate professor of quantitative biology at Cold Spring Harbor Laboratory, where the analysis took place. He notes that despite its importance, structural variation has not been thoroughly studied because short-read sequencers cannot reliably identify these large genomic elements. “One of the really special properties about the PacBio Sequencer is, in addition to being able to call SNPs or small variants, we also get to look for large variants such as structural variation,” he says.
But as Schatz and his collaborators at Cold Spring Harbor Laboratory and the Ontario Institute for Cancer Research delved into this work, they realized that existing variant callers were tailored to short-read data. To make the most of the large amount of long-read information they were generating, the team wrote a suite of new analysis tools optimized for SMRT Sequencing data. “The tools catering to short-read data just aren’t made to capture the awesome information that we can now take advantage of,” says Maria Nattestad, a graduate student in Schatz’s lab who wrote several of the new algorithms. “Building our own tools was really the only way to go here.”
Those tools, which are especially important for understanding structural variation, are now being publicly released to fuel further SMRT Sequencing studies of human genomes. Also coming out soon is the team’s detailed analysis of the SK-BR-3 genome and transcriptome, which includes a high-quality assembly as well as a new understanding of gene fusions, the evolutionary history of this cell line, and more.
De novo sequencing and assembly were the first steps in making sense of the SK-BR-3 genome. With 72-fold SMRT Sequencing coverage, “we got an outstanding assembly of this genome even though it’s so complicated,” Schatz says, citing a contig N50 size of 2.5 Mb compared to a state-of-the-art short-read assembly with a contig N50 of just 3 kb. “That’s nearly a thousand-fold more contiguous going from short-read to long-read assemblies, and it’s through that improved assembly that the majority of structural variants were detected.”
Using custom-built analysis tools, including variant callers Sniffles, by Schatz lab member Fritz Sedlazeck, and Assemblytics, by Nattestad, the scientists found more than 10,000 structural variants in the SK-BR-3 genome ranging in size from 50 bases to millions of base pairs long. Another major discovery involved meticulously characterizing the complicated process that led to the cell line’s Her2 oncogene amplification.
The team also used the Iso-Seq™ method to analyze the full transcriptome of SK-BR-3, finding as much complexity at the RNA level as they saw in the DNA. “In the Iso-Seq analysis, we see many tens of thousands of novel isoforms,” Schatz says. “That’s a really strong testament to the long reads, which fully capture an isoform in one sequence — unlike short reads, where you have to infer isoform structure.”
To learn more about the project, which included novel findings about gene fusions in cancer, check out the full case study.
April 13, 2016 | General