We describe sawfish, a structural variant (SV) caller for mapped high-quality long reads. This method emphasizes assembly of local SV haplotypes and their utilization in downstream sample merging and genotyping steps, improving accuracy compared to variant-focused approaches in both individual and joint-genotyping contexts.
Assessing sawfish against the GIAB draft SV benchmark based on the T2T-HG002-Q100 diploid assembly shows substantial accuracy gains compared to pbsv and Sniffles2 on HiFi WGS 33x input, with a sawfish F1 score of 0.971 compared to 0.930 and 0.935 for pbsv and Sniffles2, respectively. This accuracy gain persists at lower depth, for example at 10x depth the sawfish F1 score is 0.937, compared to 0.857 and 0.882 for pbsv and Sniffles2. For SVs in the GIAB Challenging Medically Relevant Genes benchmark, sawfish has a combined false positive and false negative count of 4, compared to 19 and 15 with pbsv and Sniffles2, respectively.
Sawfish also has higher genotype concordance in the Platinum Pedigree (CEPH-1463). Joint-genotyping accuracy was assessed on 10 HiFi WGS samples comprising the 2nd and 3rd pedigree generations, where the known inheritance pattern enables genotype accuracy assessment. From high genotype-quality calls, sawfish yields 27,811 concordant and 4,414 discordant SV alleles (86.3% concordance), where concordant alleles respectively represent 7.8 Mb and 11.9 Mb of deleted and inserted sequence. This substantially improves concordant allele count, length and percent concordance compared to the next most concordant method, Sniffles2, with 20,519 concordant and 7,645 discordant alleles (72.9% concordance), where concordant alleles represent 4.2 Mb and 5.6 Mb of deleted and inserted sequence.
As additional improvements, our assembly-focused approach allows all calls to be made with single-base precision, enabling breakpoint insertion and homology annotation for all SV types. Sawfish also assesses depth of large deletions and duplications to evaluate their consistency with its own expected GC-corrected depth model, improving precision for these large SV types. Through the combination of high genotyping accuracy, detailed breakpoint modeling, and joint assessment of breakpoint evidence with read depth, sawfish offers improved options for WGS sample analysis with high-quality long reads.