In eukaryotic organisms, the majority of genes are alternatively spliced to produce multiple transcript isoforms. Gene regulation through alternative splicing can dramatically increase the protein-coding potential of a genome. Therefore, understanding the functional biology of a genome requires knowing the full complement of isoforms. Microarrays and high-throughput cDNA sequencing are useful tools for studying transcriptomes, yet these technologies provide only small snippets of transcripts. Accurately reconstructing complete transcripts to study gene isoforms has been challenging [1, 2].
The Iso-Seq method produces full-length transcripts using Single Molecule, Real-Time (SMRT) sequencing [3]. Long read lengths enable sequencing of full-length transcripts up to 10 kb or longer, eliminating the need for transcript assembly or inferencing. The Iso-Seq bioinformatics pipeline, which is freely available through SMRT Analysis, further processes the data into high-quality consensus transcript sequences that enable accurate isoform annotation and open reading frame prediction [4].
Since it does not require a reference genome or existing annotation, the Iso-Seq method has been widely adopted by the scientific community to analyze a variety of important agricultural crops and animals such as coffee, cotton, maize, rabbit, chicken, and many others. In all cases, the researchers discovered a much more diverse and complex transcriptome than previously understood. For example, Kuo et al. expanded the chicken annotation to ~64,000 transcripts, of which ~21,000 were novel lncRNAs not annotated in Ensembl. In another case, Wang et al. were able to expand and correct the maize B73 genome annotation, including the discovery of 867 novel lncRNA transcripts.
The ability to unambiguously determine the full exonic structure of complex genes, with no assembly required, also makes the Iso-Seq method attractive to the study of human diseases. Kohli et al. were able to characterize androgen receptor (AR) isoforms in castration-resistant prostate cancer to show that one novel isoform, AR-V9, was co-expressed with AR-V7 and predictive of drug resistance. Tseng et al. discovered novel splice patterns in the FMR1 gene in premutation carriers for Fragile X-associated tremor/ataxia syndrome that were undetected in the control group.
Perhaps somewhat surprisingly, after the Iso-Seq dataset for the MCF-7 breast cancer cell line was released to the public [5], it was revealed that this well-studied sample contained more cancer fusion genes, two new mitochondrial lncRNAs and novel sample-specific transcripts. In a recently published study, Anvar et al. used this same deep MCF-7 dataset to show that there is widespread coupling of transcript features, where more than 7,000 genes were found to have preferential coupling of 5’ start sites, exons, and polyadenylation sites. Such a study would not have been possible without the ability to precisely determine the starts and ends, as well as the splice junctions, of each transcript isoform.
But the Iso-Seq method is not just limited to eukaryotes. Recently, a new protocol called SMRT-Cappable-seq was developed to sequence the E. coli transcriptome. The result is a dramatic increase in the number of annotated operons and readthrough for the bacterium. Similarly, the Iso-Seq method was used to discover new coding and anti-sense transcripts in the previously poorly annotated human cytomegalovirus.
Since the launch of the Iso-Seq protocol in SMRT Analysis in 2014, the analysis pipeline has seen several improvements. The new Iso-Seq2 protocol, released in SMRT Analysis 5.1 last month, improves both speed and transcript recovery [6]. More importantly, over the past five years the bioinformatics community has embraced the technology, sparking the development of additional tools. IsoCon, IDP, and IDP-denovo are error correction methods that work for targeted genes or hybrid data. Specialized long read aligners such as minimap2 now support alternative splicing. Cupcake and TAMA are two lightweight alignment processing tool suites. SQANTI categorizes Iso-Seq transcripts against an existing annotation and combines it with short read expression data. A growing list of community tools is maintained at the Iso-Seq wiki.
We encourage our users to continue finding new ways to utilize full-length transcript sequencing with PacBio and contribute to exciting biological discoveries!
Select publications:
- Long-read sequencing of the coffee bean transcriptome reveals the diversity of full-length transcripts. GigaScience 1–13 (2017). doi:10.1093/gigascience/gix086
- Wang, M. et al. A global survey of alternative splicing in allopolyploid cotton: landscape, complexity and regulation. New Phytol 217, 163–178 (2017).
- Wang, B. et al. Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing. Nat Comms 7, 11708 (2016).
- Chen, S.-Y., Deng, F., Jia, X., Li, C. & Lai, S.-J. A transcriptome atlas of rabbit revealed by PacBio single-molecule long-read sequencing. Sci. Rep. 7, 1–10 (2017).
- Kuo, R. I. et al. Normalized long read RNA sequencing in chicken reveals transcriptome complexity similar to human. BMC Genomics 18, 1–19 (2017).
- Kohli, M. Androgen receptor variant AR-V9 is coexpressed with AR-V7 in prostate cancer metastases and predicts abiraterone resistance. Clin Cancer Res 23, 1–13 (2017).
- Tseng, E., Tang, H.-T., AlOlaby, R. R., Hickey, L. & Tassone, F. Altered expression of the FMR1 splicing variants landscape in premutation carriers. BBA – Gene Regulatory Mechanisms 1860, 1117–1126 (2017).
- Weirather, J. L. et al. Characterization of fusion genes and the significantly expressed fusion isoforms in breast cancer by hybrid sequencing. Nucleic Acids Research 43, e116–e116 (2015).
- Gao, S. et al. Two novel lncRNAs discovered in human mitochondrial DNA using PacBio full-length transcriptome data. Mitochondrion 38, 41–47 (2018).
- Chakraborty, S. MCF-7 breast cancer cell line PacBio generated transcriptome has ~300 novel transcribed regions, un-annotated in both RefSeq and GENCODE, and absent in the liver, heart and brain transcriptomes. 1–8 (2017). doi:10.1101/100974
- Anvar, S. Y. et al. Full-length mRNA sequencing uncovers a widespread coupling between transcription initiation and mRNA processing. Genome Biol. 19, 1–18 (2018).
- Yan, B., Boitano, M., Clark, T. & Ettwiller, L. SMRT-Cappable-seq reveals complex operon variants in bacteria. bioRxiv 1–34 (2018). doi:10.1101/262964
- Balazs, Z. et al. Long-read sequencing of human cytomegalovirus transcriptome reveals RNA isoforms carrying distinct coding potentials. Sci. Rep. 1–9 (2017). doi:10.1038/s41598-017-16262-z
References and resources:
[1] Steijger, T. et al. Assessment of transcript reconstruction methods for RNA-seq. Nat Meth 10, 1177–1184 (2013).
[2] Angelini, C., Canditiis, D. & Feis, I. Computational approaches for isoform detection and estimation: good and bad news. BMC Bioinformatics 15, 135–43 (2014).
[3] Iso-Seq template preparation for Sequel systems
[4] Gordon, S. P. et al. Widespread polycistronic transcripts in fungi revealed by single-molecule mRNA sequencing. PLoS ONE 10, e0132628 (2015).
[5] PacBio MCF-7 blogpost: /blog/data-release-human-mcf-7-transcriptome/
[6] PacBio Iso-Seq GitHub: https://github.com/PacificBiosciences/IsoSeq_SA3nUP/