pandora icon indicating copy to clipboard operation
pandora copied to clipboard

Make genotype compatible issue with long alleles

Open leoisl opened this issue 6 years ago • 5 comments

In a gene, we have these records (one with very long alleles) - this is a true gene, but I am putting a simplified view of it:

pos, ref, alt, gt, gt_conf
1, <long_ref_spanning_1500_bps>, <long_alt_spanning_1500_bps>, 1, 10
100, A, C, 0, 50

The first record makes a statement about 1500 bps, and call the alt. The second record makes a statement about 1 bp, and call the ref. These records conflict, so we solve this conflict to make the VCF compatible. Record2 is going to win the conflict because its call has higher gt_conf. Then, we will change the gt of Record1 from 1 to 0. So, our statement about 1500 bps will change (some bases between ref and alt will be the same, but for all bases that are different, we are changing our statement). So, we needed to change our statement about 1 bp (the one at position 100), but we end up changing our statement about all the other 1499 bps that do not conflict with the bp at position 100.

I am wondering if we should make only one base change in this case, i.e. if Record1 says that at position 100 we should have a C, and Record2 says it is an A (and with higher likelihood), we should just change this base only, not the whole call.

The top5 genes where we have the highest recall difference between no_denovo and denovo present this issue. When we have such long alleles, it is almost certain that we will change our statement about all its called bases, because it overlaps dozens of records, and all it takes is a single record having higher gt_conf.

leoisl avatar Nov 19 '19 13:11 leoisl

We had a long and v g discussion about this. we should have a call with @rmcolq

iqbal-lab avatar Nov 19 '19 15:11 iqbal-lab

Yeah, it would be reassuring though if @rmcolq confirms that this can indeed happen, it still might be my misunderstanding of pandora

leoisl avatar Nov 19 '19 15:11 leoisl

I think it can happen like you describe. Although I would have thought that the updated genotype for record1 would be '.' not '0' (as we don't know about the other bases?). The only downside I see is that the resulting VCF will no longer reflect the PRG. This (changing specific letters to allow us to keep the longer allele called) could get more complicated when there are multiple samples

rmcolq avatar Nov 19 '19 15:11 rmcolq

I think it can happen like you describe. Although I would have thought that the updated genotype for record1 would be '.' not '0' (as we don't know about the other bases?).

I guess when we solve conflicts, and the winner genotypes towards the ref, we change the genotype of the loser towards the ref also?

The only downside I see is that the resulting VCF will no longer reflect the PRG. This (changing specific letters to allow us to keep the longer allele called) could get more complicated when there are multiple samples

Indeed!

leoisl avatar Nov 19 '19 15:11 leoisl

We had some real gene examples showing that this is a non-trivial problem and deserves more careful thinking. The maximum likelihood path finding algorithm is more precise now, and it actually produces better results than the regenotyping/make gt compatible procedure (at least in the 4-way analysis). We are now regenotyping only the records in the ML path. This issues should be revisited later, though...

leoisl avatar Nov 20 '19 17:11 leoisl