Clair3
Clair3 copied to clipboard
No variant calling close to the edge of reads?
Hello,
I am using clair3 for variant calling on Nanopore data (running with default parameters, excpt for --include_all_ctgs). I observed, that clair3 seems to not call possible variants within up to ~10-15 bases on the beginning/end of reads. I am sorry, if I overlooked something obvious, but is this generally done for all reads (because the quality normally gets worse towards the beginning/end of the reads)? Or are these actually quality-filtered and this only happens in reads with bad quality?
In either case, is it possible to turn this „filtering“ off (I did not find a parameter for that)?
As an example I attached an IGV Screenshot of such a case. Here you see three tracks of .vcf files (Barcode 11, 18 and 19) and three tracks of the corresponding .bam files (also barcode 11, 18 and 19) In the .bam file of barcode 18 there is an obvious variant at position 11 (C>G) with 95% Gs, which is not called in the .vcf file. Coverage and quality should be sufficient for a calling and I saw identical cases in my data in different positions all somewhere within ~15 bases of the beginning or end of reads.
We need more details to conclude. Would it be possible for you to provide us a bam of this case?
Thanks for the quick reply. I would like to, but the file seems to be too big (1gb) to upload it here. Can I send it to you another way? So you do not know of any standard filter, that excludes the first/last X bases of a read from calling by default?
It could be filters or just a decision of the calling network. Could you generate a small bam with a subset of reads aligned only to the position of interest. samtools view filename.bam EFAU004_0264:1-81
should do the job. We are interested in having a look at the reads to find out if it is a false negative by Clair3.
Thanks. That works and the output is now small enough.
VRE_BC_18-new_FLAGs_EFAU004_00264_1-81.txt
If this is not significant enough, I can provide a few more cases.
Thanks again for looking into this.
It took us a while to do some experiments. The conclusion is that Clair3's framework cannot call variants in the head or tail 16bp of a sequence as reliable as those not in the head or tail 16bp of a sequence. We did experiments on four randomly chosen variants and padded either side of the input with zero at variable lengths to simulate forcing Clair3 to call variants in the head or tail 16bp of a sequence. The results are shown in the figure below. Significant variant quality score decay or failure was observed in all four cases. Having said that, it is not impossible for Clair3 to also identify possible variant signals in the head or tail 16bp of a sequence, but it'd be unreliable, especially with the observation that both ends of a sequence are usually with a depth coverage lower than the average (with an exception in amplicon sequencing). At the moment, I tend not to provide an option to force calling variants from the head or tail 16bp of a sequence. But a hack to this is simply adding 16bp of 'N' to the head and tail of a sequence (before sequence alignment) to push the whole sequence into the scope of Clair3.
data:image/s3,"s3://crabby-images/153fd/153fde8924db186425da8095e2abcf2a7bf1a7fd" alt="Screen Shot 2022-09-14 at 12 41 47"
Thank you for your extensive analysis of the problem. This helps a lot! And also thanks for the suggestions how to bypass this. I will have to look into my data now and think about, how I want to solve this. :)