Scientists from University of North Carolina at Chapel Hill, Duke University, and other institutions have teamed up to sequence an important region of the human genome that has until now proven impenetrable.
In a paper entitled “Genome Reference and Sequence Variation in the Large Repetitive Central Exon of Human MUC5AC,” published in the American Journal of Respiratory Cell and Molecular Biology, corresponding authors Wanda O’Neal and Judith Voynow along with their collaborators describe the use of Single Molecule, Real-Time (SMRT®) Sequencing to characterize a complex mucin exon.
MUC5AC, located on the P arm of chromosome 11, encodes one of the two major secreted mucins found in our airways and contributes to gastrointestinal homeostasis through its activity in the gut. The gene is predicted to be implicated in a range of diseases, including cystic fibrosis, asthma, inflammatory bowel disease, tumor development, and more.
Despite its clearly important role, MUC5AC has been represented as a gap in both the Genome Reference Consortium (GRCh37) and human genome reference (hg19) sequences. The central exon was known to be complex, large, and highly repetitive. Previous efforts to characterize the region with Sanger or short-read sequencing technologies were unsuccessful; similar problems exist for other MUC genes. Without the accurate sequence of the exon, mucins have not been well represented in genome-wide, exome, or gene expression studies that require prior knowledge of a genetic region.
In this paper, the scientists report using the PacBio® sequencer along with high-fidelity, long PCR to obtain the first high-quality sequence data covering the entire span of MUC5AC. The region was sequenced from the DNA of four people, including one healthy African-American male and three Caucasian females homozygotic for cystic fibrosis. The sequence length ranged from ~10 to 12 kb across the four genomes, making MUC5AC one of the longest exons found in the human genome. The authors report that SMRT Sequencing “yielded long sequence reads and robust coverage that allowed for de novo sequence assembly spanning the entire repetitive region.” It was validated by reduced representation SMRT Sequencing of native genomic DNA enriched for this region.
The characterization of MUC5AC long exon structure from the reduced representation library supported the structure of the cysD domain and the variable number tandem repeat (VNTR) regions that were observed in the PCR amplicons, validating the fidelity of the assay. “The results demonstrated the presence of segmental duplications of CysD domains, insertions/deletions (indels) of tandem repeats, and single nucleotide variants,” they note. “Taken together, these data illustrate the successful utility of SMRT Sequencing long reads for de novo assembly of large repetitive sequences to fill the gaps in the human genome.” Extrapolating from this experience with the PacBio platform, the authors note that the approach they used to sequence MUC5AC for the first time could be useful for filling other gaps, particularly those with highly repetitive regions, in human and other genomes.
For more on structural findings in the human genome, don’t miss our ASHG workshop on Thursday, October 24th.