A new paper out in PNAS details the usefulness of long reads for isoform sequencing. “Characterization of the human ESC transcriptome by hybrid sequencing” comes from lead author Kin Fai Au and senior author Wing Wong at Stanford University as well as a number of collaborators.
The authors detail the problem that they see with current RNA-seq studies: the inability to capture full-length mRNA isoforms (averaging about 2,500 bases) by using reads of just a few hundred base pairs. “We are still far from achieving the original goals of RNA-Seq analysis, namely the de novo discovery of genes, the assembly of gene isoforms, and the accurate estimation of transcript abundance at the gene or the isoform level,” Au et al. write. They note that isoform detection or prediction with short reads is even more difficult when the full set of possible isoforms is not known going into the project.
The scientists describe a new approach, combining short-read Illumina® and long-read PacBio® sequence data and pairing that with a computational tool to predict isoforms as a more comprehensive means of examining transcripts. They tested the method in a well-characterized line of human embryonic stem cells (hESCs) and validated their findings with follow-on qPCR and knockdown studies.
By adding SMRT® Sequencing to the study, the authors report direct detection of more than 8,000 full-length, RefSeq-annotated isoforms, as well as prediction of nearly 5,500 other isoforms using the Isoform Detection and Prediction computational tool. “Over one-third of these are novel isoforms, including 273 RNAs from gene loci that have not previously been identified,” the scientists write. They add that long noncoding RNAs are especially likely to be lost in short-read studies and that consequently there is “significant downward bias in the current strategy for genome-wide discovery” of these genetic elements.
The authors use one particular example to demonstrate the complexity of isoform analysis. “Several long reads with up to four junctions were mapped to the locus chr6:167,641,267–167,660,912 (hg19, the same below), where no annotated genes in RefSeq, Ensembl, UCSC Known Genes, or GENCODE are reported. The long reads indicated complex expression from this locus with at least three different isoforms transcribed,” they write.
“In our approach the error-corrected long reads are ideal for narrowing down the isoforms expressed in a sample, thus enabling much more reliable abundance quantification from [second-generation sequencing] reads,” report Au et al. They note that results from studies of their Isoform Detection and Prediction tool show that it is “effective in using the information from the PacBio long reads to significantly improve isoform identification.” In addition, these results “suggest that gene identification, even in well-characterized cell lines and tissues, is far from complete.”