A new Nature Communications paper shows how scientists continue to make progress elucidating some of the most complex regions of the human genome by deploying long-read PacBio sequencing technology. In this case, lead author PingHsun Hsieh (@phhBenson), senior author Evan Eichler, and collaborators at the University of Washington resolved the TCAF gene locus and identified more than 100 kb that had been missing in the human reference genome.
Since the publication comes from the Eichler lab, it’s no surprise that the target genes in this project emerged in a segmental duplication (SD) region. The TCAF genes — which encode TRP channel-associated factors related to thermal sensing in a type of neuron — “originated from an ancient gene duplication event at the basal of mammalian phylogeny and remained single-copy genes throughout much of their evolution,” the scientists report. In humans, duplications of this region in the past 1.7 million years have led to more copies of TCAF1 and TCAF2.
Until now, this locus of the genome has remained intractable. “In the human reference genome GRCh38, TCAF1 and TCAF2 are embedded and span within a complex region of large, highly identical SDs (>99.5%) consisting of >250 thousand base pairs (kbp) in sequence and an annotated gap at chromosome 7q35,” the authors note.
Filling A Pesky Gap
In this project, the team paired PacBio sequencing with large-insert bacterial artificial chromosome clones to resolve the entire locus in eight humans as well as in chimpanzee, gorilla, and rhesus macaque, generating 15 haplotypes of the region and even comparing their results to those seen in ancient human genomes. They also used the Iso-Seq method to analyze gene expression in seven different tissues.
“We systematically explore the haplotype structure of the TCAF locus in order to study its diversity, annotate the genes, and infer its evolutionary history in the context of selection,” the scientists report.
“This study is one of the detailed genetic investigations of human-specific SDs shedding potential new insights into structural adaptations important in thermal regulation.”
Sequencing results revealed that TCAF paralogs were more than 99.7% identical, with sizes ranging from 10 kb to 60 kb. They also filled that pesky gap in the human reference genome by identifying the missing 103,616 bp. The team focused on haplotypes of the region. While the non-human primates had just one copy each of the TCAF1 and TCAF2 genes, the 12 resolved human haplotypes were quite different. “We identify five distinct haplogroups that carry one to three copies for the SD cassette, which range from 145–406 kbp in length,” they write.
Isoform Diversity Sheds Light on Ancestral Diversity
These haplotypes allowed the team to dive into annotation and analyze isoform diversity. Using the Iso-Seq method, they produced more than 480,000 full-length, non-chimeric transcripts from analyses of six human tissues and more than 50,000 from a chimpanzee cell line. In humans they found considerably more isoform diversity for TCAF2 than for TCAF1.
Perhaps most strikingly, though, the scientists found evidence of contrary patterns of selection.
“Our data support a model of two distinct forces of natural selection possibly operating on the same locus over the last half million years of hominin evolution,” they report.
“We propose that diversifying or balancing selection is likely acting in at least some human populations, particularly out-of-African populations such as Native Americans, to maintain and expand haplotype and structural diversity.”
The ancient human samples told a different story.
“In contrast, Neanderthal and Denisovan show a paucity of genetic variation, and while the sample size is still limited, this observation is unlikely to change with the sequencing of additional archaic genomes,” the scientists add. “We hypothesize that positive selection has reduced genetic diversity at the TCAF locus in these archaic hominin lineages.”
Interested in learning more about Iso-Seq Analysis? Go here.