Cortex calling regions with no reads as indels
Hi Zam, I believe you have previously been in contact with Phil Lobb from the UKHSA TB detection team surrounding our implementation of clockwork. Thank you for helping us understand the choice to use diploid settings for Cortex and BCFTools mpileup. Unfortunately, we are now having a different issue with cortex as part of our clockwork implementation. To give some background, our minos vcf contains variants which have an AD value of 0, which cause downstream processing errors with tb-profiler. This is an example from the minos vcf for one sample, where two G-->C variants appear with AD=0:
NC_000962.3 1414935 . G C . PASS . GT:AD:COV:DP:DPF:FRS:GT_CONF:GT_CONF_PERCENTILE 1/1:0,0:0,105:105:1.3202:1.0:674.53:77.66 NC_000962.3 1414937 . G C . PASS . GT:AD:COV:DP:DPF:FRS:GT_CONF:GT_CONF_PERCENTILE 1/1:0,0:0,105:105:1.3202:1.0:674.53:77.66
And here is the downstream error caused by this malformed vcf, which enables us to catch these instances:
2025-02-18 16:02:01,703 - DEBUG - {1414935: ['C'], 1414936: 'G', 1414937: ['C']} 2025-02-18 16:02:01,703 - DEBUG - CGC 2025-02-18 16:02:01,703 - DEBUG - <pysam.libcbcf.VariantRecordSample object at 0x2adbf5f5c3d0> 2025-02-18 16:02:01,703 - DEBUG - Reached AD 2025-02-18 16:02:01,703 - DEBUG - {'GGG': 0, 'CGC': 0} 2025-02-18 16:02:01,703 - DEBUG - Counter({'GGG': 0, 'CGC': 0}) 2025-02-18 16:02:01,703 - DEBUG - dp value 0
variant.info.update({'AF':count/dp})
~~~~~^~~
ZeroDivisionError: division by zero
These variants originate from the cortex vcf prior to minos adjudication, where a large indel is called by cortex at position 1414787, an indel that spans these variants. However, upon inspection of the bam file, we see that there are no reads at this position, so we are confused as to how cortex could make this call – I’ve attached the IGV plot of this position for the sample below.
Indeed, our analysis of 49 Mycobacterium africanum samples shows 32 samples have this type of AD=0 error in the minos vcf, and by further investigating these samples, it seems to consistently be the same cortex issue where cortex calls an indel instead of a deletion at a position with no reads. These problematic indels/deletions are found at different positions in the genome for all failing samples, and consistently not in regions characterised as divergent from Mycobacterium tuberculosis (https://pmc.ncbi.nlm.nih.gov/articles/PMC497617/). I’ve attached a couple more examples where this issue is arising below – each time, the blue bar represents an indel from cortex.
Do you have any ideas what could be causing cortex to call an indel in these positions where there are no reads? Thanks for all your help with our clockwork implementation. Kind regards, Matt
Hi there, coukd you show me the cortex call associated with those last 2 pileups? Are you saying cortex calls an indel which is not the deletion one would infer from that image?
Hi Zam, Thank you for your prompt response. Yes, in those two examples, cortex calls an indel instead of a deletion as you would predict. I’ve added the cortex call to the two IGV plots below:
I've also attached the cortex file for these two samples. These indels create variants with depths of zero in the minos vcf, which cause the downstream tb-profiler to crash out as below:
For sample 2025-02-18 16:03:10,681 - DEBUG - [GenomePosition(chrom='NC_000962.3', pos=2219760), GenomePosition(chrom='NC_000962.3', pos=2219761)] 2025-02-18 16:03:10,681 - DEBUG - ---------------------- 2025-02-18 16:03:10,681 - DEBUG - GG 2025-02-18 16:03:10,681 - DEBUG - adkasjdaosdjaoisd 2025-02-18 16:03:10,681 - DEBUG - ds----- 2025-02-18 16:03:10,681 - DEBUG - defaultdict(<class 'list'>, {2219760: ['G'], 2219761: ['G']}) 2025-02-18 16:03:10,681 - DEBUG - GG 2025-02-18 16:03:10,681 - DEBUG - <pysam.libcbcf.VariantRecordSample object at 0x2acc1bbb6020> 2025-02-18 16:03:10,681 - DEBUG - Reached AD 2025-02-18 16:03:10,681 - DEBUG - {'CA': 0, 'GG': 0} 2025-02-18 16:03:10,681 - DEBUG - Counter({'CA': 0, 'GG': 0}) 2025-02-18 16:03:10,681 - DEBUG - dp value 0 variant.info.update({'AF':count/dp}) ~~~~~^~~ ZeroDivisionError: division by zero
2025-02-18 16:02:12,341 - DEBUG - [GenomePosition(chrom='NC_000962.3', pos=2338464), GenomePosition(chrom='NC_000962.3', pos=2338466)] 2025-02-18 16:02:12,341 - DEBUG - ---------------------- 2025-02-18 16:02:12,341 - DEBUG - GG 2025-02-18 16:02:12,341 - DEBUG - adkasjdaosdjaoisd 2025-02-18 16:02:12,341 - DEBUG - ds----- 2025-02-18 16:02:12,341 - DEBUG - {2338464: ['G'], 2338465: 'C', 2338466: ['G']} 2025-02-18 16:02:12,341 - DEBUG - GCG 2025-02-18 16:02:12,341 - DEBUG - <pysam.libcbcf.VariantRecordSample object at 0x2b1a81ec8ca0> 2025-02-18 16:02:12,341 - DEBUG - Reached AD 2025-02-18 16:02:12,341 - DEBUG - {'CCA': 0, 'GCG': 0} 2025-02-18 16:02:12,341 - DEBUG - Counter({'CCA': 0, 'GCG': 0}) 2025-02-18 16:02:12,341 - DEBUG - dp value 0 variant.info.update({'AF':count/dp}) ~~~~~^~~ ZeroDivisionError: division by zero
Thanks for your help with this, Matt
15bb7215-9f35-450c-81d9-c9b34136272b.R.cortex.vcf.txt 17f2cfa8-a67f-4766-b038-244a8043f7b3.cortex.vcf.txt
Hi Zam,
Just following up on my previous question to see if you have any additional comments or thoughts about this issue.
Thank you again for your support with this.
Kind regards, Matt
Sorry for slow reply, will discuss with @martinghunt on Monday
It looks like cortex is calling a long alt allele, which it must have assembled from the reads. And then the reads map well to that alt allele, but not to the ref because it's too different from the reads.
What happens if you put that alt allele into the mapping ref in place of the ref allele, and then map the reads?
yes agree! Cortex will have assembled a long allele and the reads don't map to the reference so it looks like a deletion from the bam. Martin's idea of switching in the alt allele and mapping will confirm (or deny) this idea
Hi both, thank you for your suggestion.
I've replaced the original sequence with the long alternate allele called by cortex, and we now see reads spanning this position in the bam.
However, we're now confused by the nature of the alternative sequence called by clockwork. Aligning the consensus fastas from our pipelines against both the original and modified references shows no obvious pattern in the base called by clockwork. Sometimes, the consensus contains a base from the original reference, other times from the alternative allele, and often the assigned sequence appears random.
I've attached an image of our alignment of the two consensus fastas produced by our tb-pipelines (rows 1 and 2) against the original and modified references (rows 3 and 4 respectively). The alternative allele is inserted from position 2222075 (where sequences begin to diverge).
Please let us know if you have any thoughts about why clockwork is calling this seemingly random alternative sequence from this position.
Thanks, Matt
Hi there
A few things
- When you replace the ref-allele with the alt-allele for that long variant, and then map reads, how good does the depth look? does it all line up and look consistent with the left and right flanks?
- I think your fundamental question is - why is cortex calling this long allele that is nothing like the reference? Is that right? And following that - why does it look nothing like the reference? Well - cortex is an assembler, so it has constructed it by assembling reads, effectively into a contig that fits there in that context of the reference. As for why it doesn't look at all similar - this is what diverged regions look like I guess. It happens in many species - we started looking into a region like this ion P falciparum, where 1kb of the genome had 2 very different haplotypes/alleles, and when you aligned them, there was on average one SNP every 3bp for 1kb - but basically some subregions looked very different. If the region comes from elsewhere (eg by gene conversion, which we know does happen in Mtb) or for some reason there's no reason why it should look too similar.
I think the first thing to do is convince ourselves what the true sequence is in that region, and when we can interpret afterwards. Would you be able to share the sample and the coords with me and @martinghunt ?
Right - we've had some conversations offline also, let me summarise my understanding.
- The cortex calls look correct - these are regions which are highly diverged from the reference, so you get a long reference allele, and a long alternate allele, and if you were to align those two, there would be loads of SNPs in a small space. In this case, the two alleles are slightly different lengths.
- Minos splits long ref/alt alleles into sets of SNPs and indels. As far as we know this was all correct too - but there is an argument that the Cortex output is more "correct" and more useful. It all depends on what you're doing, whether you prefer the two long alleles, which makes it clear this is a diverged locus, or loads of SNPs and indels.
- However, clockwork can also outputs a couple of other things purely for building a phylogeny/find Neighbour. These are a) a gvcf, which covers every base in the genome. This takes records from the minos VCF where they exist, and for other places, uses the samtools VCF . b) an inferred/consensus fasta, which discards indels, is always the same length, and is purely for the phylogeny.
The bottom line becomes
- If you want to do resistance prediction, definitely do not use that inferred fasta, as it ignores indels.
- If you are worried about detecting SNPs in the diverged region, I sympathise, this is why we have developed graph-based tools. But for TB, there are (afaik) no resistance mutations in such regions. This is a major issue for all other bacteria, as it's essentially the pangenome problem.