cuteSV Inconsistent force calling with original discovery calling

Hi,

I am using latest cute SV (v2.0.0) and want to apply cuteSV to PacBio CLR data for population-scale analysis. I first run cuteSV on each sample and then do force calling on the merged SV set. However, I found that the force calling can be significantly different from the original discovery calling.

For example, the original discovery calling by cuteSV is:

chr1    180703  cuteSV.DEL.0    ACCCTAACCCTACCCTAACCCAACCCTAACCCAACCCTAACTCTAGCCCTAACCCTAACCCTAACCCTAACCCTAACCC A       37.4    PASS    PRECISE;SVTYPE=DEL;SVLEN=-78;END=180781;CIPOS=-64,64;CILEN=-12,12;RE=28;RNAMES=NULL;AF=0.3111;STRAND=+- GT:DR:DV:PL:GQ   0/1:62:28:37,0,362:37

While the force calling result is:

chr1    180703  cuteSV.DEL.0    ACCCTAACCCTACCCTAACCCAACCCTAACCCAACCCTAACTCTAGCCCTAACCCTAACCCTAACCCTAACCCTAACCC A       0       q5      IMPRECISE;SVTYPE=DEL;SVLEN=-78;END=180781;CIPOS=-84,84;CILEN=-6,6;RE=15;RNAMES=NULL;STRAND=+-;AF=0.0658 GT:DR:DV:PL:GQ   0/0:213:15:0,188,821:188

I did a global search and ~50% SVs have inconsistent DR + DV, while ~30% SVs have inconsistent DV. Please see the following link for a reproducible example on a small region of HG002: https://www.dropbox.com/s/t7qrvq4pqcmbbio/cuteSV.zip?dl=0

Code I use to generate the data is:

cuteSV \
--max_cluster_bias_INS 100 \
--diff_ratio_merging_INS 0.3 \
--max_cluster_bias_DEL 200 \
--diff_ratio_merging_DEL 0.5 \
--genotype \
-t 40 \
HG002.minimap2.bam \
GRCh38.fa \
discovery.vcf \
./

cuteSV \
--max_cluster_bias_INS 100 \
--diff_ratio_merging_INS 0.3 \
--max_cluster_bias_DEL 200 \
--diff_ratio_merging_DEL 0.5 \
--genotype \
-t 40 \
-Ivcf discovery.vcf \
HG002.minimap2.bam \
GRCh38.fa \
force_calling.vcf \
./

Thanks, Han

Sep 05 '22 11:09 Han-Cao

Hello, Han! @Han-Cao

Thanks for using cuteSV and giving such precise description and datasets. The inconsistency between force calling and discovery calling does exist in cuteSV. I separated the inconsistency into two parts.

Inconsistent DV.
The inconsistency on DV is due to different strategy between force calling and variant discovery calling. In force calling, the SV coordinate and length is available. And cuteSV compares the SV (in Ivcf) with SV signatures (in bam file) in a more strict way to extract the real signatures that support the SV. On the other hand, in discovery calling, we don't know these information for SVs and cuteSV clusters the corresponding signatures to generate the detected SV. Therefore, to make the cluster perform best, the strategy implemented is different from that in force calling, which would cluster a little more (or less) signatures to generate the SV. This would cause the difference between force calling and discovery calling. Also, the difference only caused by several signatures which are similar but not very alike with the SV, so we thought it would not affect much on our result.
Inconsistent DR+DV We have known that DR+DV represents the total alignment reads overlapped with the SV. In the origin discovery calling, we search in the bam file to count for the reads, which would cost much time. So we implement an early termination strategy to stop the searching in bam file under the premise that it wouldn't influence much on the genotype. But in fact, it would influence it to some extent. And in the new version of force calling (cuteSV 2.0.0), we improve the algorithm of counting alignment reads, which will report the precise number without the early termination. So the result in force calling would be more precise. Further, we are planning to implement the new strategy to the origin discovery calling. It is under development now, and I will mention you when it comes to release.

Thanks very much for giving such great example. I hope it will help.

Best, Shuqi

Sep 07 '22 05:09 Meltpinkg

Hi Shuqi (@Meltpinkg),

Thanks for your explanation. Now I understand why and how this happen, but I have some additional questions on it:

Actually I want to use INFO and FORMAT output from cuteSV in my downstream analysis. As some of the SVs have significant difference between the discovery calling and force calling (e.g., the SV I showed has AF=0.3 in discovery and AF=0.065 in force calling), the SV calling result from which method do you suggest to use? Based on your explanation, should I use force calling results as it is more strict and precise when counting DR and DV?
As some SVs can be filtered after force calling (like the one I showed), can I use force calling as a filtering strategy after the discovery calling before any downstream analysis?
For a more complicated question which may be out of the scope of cuteSV. I also used other SV callers to generate a merged call set and then applied force calling on it. If a SV called by cuteSV and other callers merged into a merged SV, the merged SV may have slightly different start position and length from cuteSV output after considering the results from all SV callers. As different callers use different algorithm to capture the signature and perform clustering, I guess the start position and length of merged SV may not that appropriate for cuteSV. In this scenario, whether I should use the results from the discovery calling or force calling?

By the way, cuteSV is really fast for me and I think it would be very helpful if I can have an option to use the new strategy. Thanks a lot for your great work!

Best, Han

Sep 07 '22 15:09 Han-Cao

Hello Han @Han-Cao

For the given example, the IGV around the deletion breakpoint is shown below. From the alignment, there are about 20 reads that has a deletion signature with the similar length around here. But there are over 200 reads in total around the breakpoint. So in my opinion, this SV has few reads supporting it compared with the total reads, and its genotype should be 0/0 and can be filtered.

And then, to test the influence on a ground truth, we benchmark the discovery result and force calling result on HG002 HiFi datasets from GiaB. The benchmark via Truvari considering high confidence region shows below:

# discovery
"precision": 0.9552,
"recall": 0.9289492791204232,
"f1": 0.9418917717921601,
"gt_concordance": 0.9802345058626466
# force calling
"precision": 0.9542942176870748,
"recall": 0.9313349237630951,
"f1": 0.9426747953138176,
"gt_concordance": 0.9809534417464915

From the benchmark, discovery and force calling doesn't vary much on the result.

Therefore, when it comes to the questions:

Yes, I suggest to use the result from force calling. Because force calling doesn't apply the early termination, it can count up the total DR. And AF is also calculated from DV and DR, so filtering with force calling results may be better.
Yes, you can. It is similar as applying the SV calling for two rounds, first round achieving the breakpoint, and second round achieving the genotype.
After merging SVs from different tools, the start position has high probability changing from the origin discovery calling. In my opinion, the ensemble calling has better perfomance than the discovery calling of a single caller. It can be seen as a revision for SV calling, integrating the advantages from various callers. So we usually use the ensemble result with the revised breakpoint to finish force calling, and use the results from force calling.

Best, Shuqi

Sep 08 '22 06:09 Meltpinkg

Hi Shuqi (@Meltpinkg),

Thank you very much for your nice explanation and benchmark, which totally solve my problem! I will use the force calling results now.

Best, Han

Sep 08 '22 13:09 Han-Cao

Hi Han (@Han-Cao),

We have updated the new version of cuteSV (v2.0.1). The DR calculating of discovery calling is also updated. You can achieve the newest version via git clone or conda. If there is any confusion about the new version, please feel free to contact me.

Best, Shuqi

Sep 15 '22 03:09 Meltpinkg

Hi Shuqi,

That's great! Thank you so much!

Best, Han

Sep 15 '22 06:09 Han-Cao

cuteSV cuteSV copied to clipboard

Inconsistent force calling with original discovery calling

cuteSV
cuteSV copied to clipboard