clockwork icon indicating copy to clipboard operation
clockwork copied to clipboard

Cortex calling regions with no reads as indels

Open matthewgaskinsukhsa opened this issue 10 months ago • 9 comments

Hi Zam, I believe you have previously been in contact with Phil Lobb from the UKHSA TB detection team surrounding our implementation of clockwork. Thank you for helping us understand the choice to use diploid settings for Cortex and BCFTools mpileup. Unfortunately, we are now having a different issue with cortex as part of our clockwork implementation. To give some background, our minos vcf contains variants which have an AD value of 0, which cause downstream processing errors with tb-profiler. This is an example from the minos vcf for one sample, where two G-->C variants appear with AD=0:

NC_000962.3 1414935 . G C . PASS . GT:AD:COV:DP:DPF:FRS:GT_CONF:GT_CONF_PERCENTILE 1/1:0,0:0,105:105:1.3202:1.0:674.53:77.66 NC_000962.3 1414937 . G C . PASS . GT:AD:COV:DP:DPF:FRS:GT_CONF:GT_CONF_PERCENTILE 1/1:0,0:0,105:105:1.3202:1.0:674.53:77.66

And here is the downstream error caused by this malformed vcf, which enables us to catch these instances:

2025-02-18 16:02:01,703 - DEBUG - {1414935: ['C'], 1414936: 'G', 1414937: ['C']} 2025-02-18 16:02:01,703 - DEBUG - CGC 2025-02-18 16:02:01,703 - DEBUG - <pysam.libcbcf.VariantRecordSample object at 0x2adbf5f5c3d0> 2025-02-18 16:02:01,703 - DEBUG - Reached AD 2025-02-18 16:02:01,703 - DEBUG - {'GGG': 0, 'CGC': 0} 2025-02-18 16:02:01,703 - DEBUG - Counter({'GGG': 0, 'CGC': 0}) 2025-02-18 16:02:01,703 - DEBUG - dp value 0

variant.info.update({'AF':count/dp})
                          ~~~~~^~~

ZeroDivisionError: division by zero

These variants originate from the cortex vcf prior to minos adjudication, where a large indel is called by cortex at position 1414787, an indel that spans these variants. However, upon inspection of the bam file, we see that there are no reads at this position, so we are confused as to how cortex could make this call – I’ve attached the IGV plot of this position for the sample below.

Image

Indeed, our analysis of 49 Mycobacterium africanum samples shows 32 samples have this type of AD=0 error in the minos vcf, and by further investigating these samples, it seems to consistently be the same cortex issue where cortex calls an indel instead of a deletion at a position with no reads. These problematic indels/deletions are found at different positions in the genome for all failing samples, and consistently not in regions characterised as divergent from Mycobacterium tuberculosis (https://pmc.ncbi.nlm.nih.gov/articles/PMC497617/). I’ve attached a couple more examples where this issue is arising below – each time, the blue bar represents an indel from cortex.

Image

Image

Do you have any ideas what could be causing cortex to call an indel in these positions where there are no reads? Thanks for all your help with our clockwork implementation. Kind regards, Matt

matthewgaskinsukhsa avatar Feb 20 '25 14:02 matthewgaskinsukhsa

Hi there, coukd you show me the cortex call associated with those last 2 pileups? Are you saying cortex calls an indel which is not the deletion one would infer from that image?

iqbal-lab avatar Feb 20 '25 17:02 iqbal-lab

Hi Zam, Thank you for your prompt response. Yes, in those two examples, cortex calls an indel instead of a deletion as you would predict. I’ve added the cortex call to the two IGV plots below:

Image

Image

I've also attached the cortex file for these two samples. These indels create variants with depths of zero in the minos vcf, which cause the downstream tb-profiler to crash out as below:

For sample 2025-02-18 16:03:10,681 - DEBUG - [GenomePosition(chrom='NC_000962.3', pos=2219760), GenomePosition(chrom='NC_000962.3', pos=2219761)] 2025-02-18 16:03:10,681 - DEBUG - ---------------------- 2025-02-18 16:03:10,681 - DEBUG - GG 2025-02-18 16:03:10,681 - DEBUG - adkasjdaosdjaoisd 2025-02-18 16:03:10,681 - DEBUG - ds----- 2025-02-18 16:03:10,681 - DEBUG - defaultdict(<class 'list'>, {2219760: ['G'], 2219761: ['G']}) 2025-02-18 16:03:10,681 - DEBUG - GG 2025-02-18 16:03:10,681 - DEBUG - <pysam.libcbcf.VariantRecordSample object at 0x2acc1bbb6020> 2025-02-18 16:03:10,681 - DEBUG - Reached AD 2025-02-18 16:03:10,681 - DEBUG - {'CA': 0, 'GG': 0} 2025-02-18 16:03:10,681 - DEBUG - Counter({'CA': 0, 'GG': 0}) 2025-02-18 16:03:10,681 - DEBUG - dp value 0 variant.info.update({'AF':count/dp}) ~~~~~^~~ ZeroDivisionError: division by zero

2025-02-18 16:02:12,341 - DEBUG - [GenomePosition(chrom='NC_000962.3', pos=2338464), GenomePosition(chrom='NC_000962.3', pos=2338466)] 2025-02-18 16:02:12,341 - DEBUG - ---------------------- 2025-02-18 16:02:12,341 - DEBUG - GG 2025-02-18 16:02:12,341 - DEBUG - adkasjdaosdjaoisd 2025-02-18 16:02:12,341 - DEBUG - ds----- 2025-02-18 16:02:12,341 - DEBUG - {2338464: ['G'], 2338465: 'C', 2338466: ['G']} 2025-02-18 16:02:12,341 - DEBUG - GCG 2025-02-18 16:02:12,341 - DEBUG - <pysam.libcbcf.VariantRecordSample object at 0x2b1a81ec8ca0> 2025-02-18 16:02:12,341 - DEBUG - Reached AD 2025-02-18 16:02:12,341 - DEBUG - {'CCA': 0, 'GCG': 0} 2025-02-18 16:02:12,341 - DEBUG - Counter({'CCA': 0, 'GCG': 0}) 2025-02-18 16:02:12,341 - DEBUG - dp value 0 variant.info.update({'AF':count/dp}) ~~~~~^~~ ZeroDivisionError: division by zero

Thanks for your help with this, Matt

15bb7215-9f35-450c-81d9-c9b34136272b.R.cortex.vcf.txt 17f2cfa8-a67f-4766-b038-244a8043f7b3.cortex.vcf.txt

matthewgaskinsukhsa avatar Feb 21 '25 09:02 matthewgaskinsukhsa

Hi Zam,

Just following up on my previous question to see if you have any additional comments or thoughts about this issue.

Thank you again for your support with this.

Kind regards, Matt

matthewgaskinsukhsa avatar Feb 26 '25 10:02 matthewgaskinsukhsa

Sorry for slow reply, will discuss with @martinghunt on Monday

iqbal-lab avatar Mar 01 '25 13:03 iqbal-lab

It looks like cortex is calling a long alt allele, which it must have assembled from the reads. And then the reads map well to that alt allele, but not to the ref because it's too different from the reads.

What happens if you put that alt allele into the mapping ref in place of the ref allele, and then map the reads?

martinghunt avatar Mar 03 '25 13:03 martinghunt

yes agree! Cortex will have assembled a long allele and the reads don't map to the reference so it looks like a deletion from the bam. Martin's idea of switching in the alt allele and mapping will confirm (or deny) this idea

iqbal-lab avatar Mar 03 '25 13:03 iqbal-lab

Hi both, thank you for your suggestion.

I've replaced the original sequence with the long alternate allele called by cortex, and we now see reads spanning this position in the bam.

However, we're now confused by the nature of the alternative sequence called by clockwork. Aligning the consensus fastas from our pipelines against both the original and modified references shows no obvious pattern in the base called by clockwork. Sometimes, the consensus contains a base from the original reference, other times from the alternative allele, and often the assigned sequence appears random.

I've attached an image of our alignment of the two consensus fastas produced by our tb-pipelines (rows 1 and 2) against the original and modified references (rows 3 and 4 respectively). The alternative allele is inserted from position 2222075 (where sequences begin to diverge).

Please let us know if you have any thoughts about why clockwork is calling this seemingly random alternative sequence from this position.

Thanks, Matt

Image

matthewgaskinsukhsa avatar Mar 10 '25 12:03 matthewgaskinsukhsa

Hi there

A few things

  1. When you replace the ref-allele with the alt-allele for that long variant, and then map reads, how good does the depth look? does it all line up and look consistent with the left and right flanks?
  2. I think your fundamental question is - why is cortex calling this long allele that is nothing like the reference? Is that right? And following that - why does it look nothing like the reference? Well - cortex is an assembler, so it has constructed it by assembling reads, effectively into a contig that fits there in that context of the reference. As for why it doesn't look at all similar - this is what diverged regions look like I guess. It happens in many species - we started looking into a region like this ion P falciparum, where 1kb of the genome had 2 very different haplotypes/alleles, and when you aligned them, there was on average one SNP every 3bp for 1kb - but basically some subregions looked very different. If the region comes from elsewhere (eg by gene conversion, which we know does happen in Mtb) or for some reason there's no reason why it should look too similar.

I think the first thing to do is convince ourselves what the true sequence is in that region, and when we can interpret afterwards. Would you be able to share the sample and the coords with me and @martinghunt ?

iqbal-lab avatar Mar 10 '25 13:03 iqbal-lab

Right - we've had some conversations offline also, let me summarise my understanding.

  1. The cortex calls look correct - these are regions which are highly diverged from the reference, so you get a long reference allele, and a long alternate allele, and if you were to align those two, there would be loads of SNPs in a small space. In this case, the two alleles are slightly different lengths.
  2. Minos splits long ref/alt alleles into sets of SNPs and indels. As far as we know this was all correct too - but there is an argument that the Cortex output is more "correct" and more useful. It all depends on what you're doing, whether you prefer the two long alleles, which makes it clear this is a diverged locus, or loads of SNPs and indels.
  3. However, clockwork can also outputs a couple of other things purely for building a phylogeny/find Neighbour. These are a) a gvcf, which covers every base in the genome. This takes records from the minos VCF where they exist, and for other places, uses the samtools VCF . b) an inferred/consensus fasta, which discards indels, is always the same length, and is purely for the phylogeny.

The bottom line becomes

  • If you want to do resistance prediction, definitely do not use that inferred fasta, as it ignores indels.
  • If you are worried about detecting SNPs in the diverged region, I sympathise, this is why we have developed graph-based tools. But for TB, there are (afaik) no resistance mutations in such regions. This is a major issue for all other bacteria, as it's essentially the pangenome problem.

iqbal-lab avatar Apr 07 '25 13:04 iqbal-lab