Clair3 icon indicating copy to clipboard operation
Clair3 copied to clipboard

No variant calling close to the edge of reads?

Open SebastianMeyer1989 opened this issue 2 years ago • 6 comments

Hello,

I am using clair3 for variant calling on Nanopore data (running with default parameters, excpt for --include_all_ctgs). I observed, that clair3 seems to not call possible variants within up to ~10-15 bases on the beginning/end of reads. I am sorry, if I overlooked something obvious, but is this generally done for all reads (because the quality normally gets worse towards the beginning/end of the reads)? Or are these actually quality-filtered and this only happens in reads with bad quality?

In either case, is it possible to turn this „filtering“ off (I did not find a parameter for that)?

As an example I attached an IGV Screenshot of such a case. Here you see three tracks of .vcf files (Barcode 11, 18 and 19) and three tracks of the corresponding .bam files (also barcode 11, 18 and 19) In the .bam file of barcode 18 there is an obvious variant at position 11 (C>G) with 95% Gs, which is not called in the .vcf file. Coverage and quality should be sufficient for a calling and I saw identical cases in my data in different positions all somewhere within ~15 bases of the beginning or end of reads.

IGV_clair3_not_calling_near_the_edge_v1

SebastianMeyer1989 avatar Sep 02 '22 09:09 SebastianMeyer1989

We need more details to conclude. Would it be possible for you to provide us a bam of this case?

aquaskyline avatar Sep 02 '22 13:09 aquaskyline

Thanks for the quick reply. I would like to, but the file seems to be too big (1gb) to upload it here. Can I send it to you another way? So you do not know of any standard filter, that excludes the first/last X bases of a read from calling by default?

SebastianMeyer1989 avatar Sep 02 '22 14:09 SebastianMeyer1989

It could be filters or just a decision of the calling network. Could you generate a small bam with a subset of reads aligned only to the position of interest. samtools view filename.bam EFAU004_0264:1-81 should do the job. We are interested in having a look at the reads to find out if it is a false negative by Clair3.

aquaskyline avatar Sep 02 '22 15:09 aquaskyline

Thanks. That works and the output is now small enough.

VRE_BC_18-new_FLAGs_EFAU004_00264_1-81.txt

If this is not significant enough, I can provide a few more cases.

Thanks again for looking into this.

SebastianMeyer1989 avatar Sep 05 '22 09:09 SebastianMeyer1989

It took us a while to do some experiments. The conclusion is that Clair3's framework cannot call variants in the head or tail 16bp of a sequence as reliable as those not in the head or tail 16bp of a sequence. We did experiments on four randomly chosen variants and padded either side of the input with zero at variable lengths to simulate forcing Clair3 to call variants in the head or tail 16bp of a sequence. The results are shown in the figure below. Significant variant quality score decay or failure was observed in all four cases. Having said that, it is not impossible for Clair3 to also identify possible variant signals in the head or tail 16bp of a sequence, but it'd be unreliable, especially with the observation that both ends of a sequence are usually with a depth coverage lower than the average (with an exception in amplicon sequencing). At the moment, I tend not to provide an option to force calling variants from the head or tail 16bp of a sequence. But a hack to this is simply adding 16bp of 'N' to the head and tail of a sequence (before sequence alignment) to push the whole sequence into the scope of Clair3.

Screen Shot 2022-09-14 at 12 41 47

aquaskyline avatar Sep 14 '22 04:09 aquaskyline

Thank you for your extensive analysis of the problem. This helps a lot! And also thanks for the suggestions how to bypass this. I will have to look into my data now and think about, how I want to solve this. :)

SebastianMeyer1989 avatar Sep 14 '22 07:09 SebastianMeyer1989