pandora icon indicating copy to clipboard operation
pandora copied to clipboard

Bug on gap computation if samples have different depths

Open leoisl opened this issue 5 years ago • 9 comments
trafficstars

The minimum kmer coverage parameter is used to define a threshold where a kmer is present or absent. Minimizer kmers with coverage less than this threshold are considered absent, and these absent kmers are considered as gaps of coverage when computing the likelihood of an allele. This threshold is set once for all samples, and equals 0.1 the expected kmer depth of the first sample:

https://github.com/rmcolq/pandora/blob/955e1c0c7bcd127361d66a74342a8ac19d651a2a/src/compare_main.cpp#L420-L421

This means that the minimum kmer coverage threshold for all samples is set based on the kmer depth of the first sample. This threshold is later used to compute the gaps:

https://github.com/rmcolq/pandora/blob/955e1c0c7bcd127361d66a74342a8ac19d651a2a/src/sampleinfo.cpp#L73-L76

when computing the likelihood of all alleles:

https://github.com/rmcolq/pandora/blob/955e1c0c7bcd127361d66a74342a8ac19d651a2a/src/sampleinfo.cpp#L102

thus affecting genotyping and genotype confidence, but I don't actually know what would be the impact of this bug in our evaluation.

Current code works fine if all samples have similar expected depth coverage.

I am currently working on https://github.com/rmcolq/pandora/issues/216. https://github.com/rmcolq/pandora/issues/216 will enable us to compare genotype confidences of samples with different depths because we will be comparing their percentiles. Thus, an implicit assumption is that pandora genotyping should not assume that all samples have a similar coverage anymore.

This bug is not hard to solve, so the solution will be implemented very soon. If there is another thing you might remember that might get affected by not assuming samples have similar coverage anymore, please tell me.

leoisl avatar Jun 05 '20 16:06 leoisl

good spot! it used to be true that error rate was calculated for the first sample and then applied to all - is that bug gone or does it remain?

iqbal-lab avatar Jun 05 '20 16:06 iqbal-lab

The error rate update bug is a convoluted integration bug I think...

  1. The first sample is mapped using the default or argument error rate: https://github.com/rmcolq/pandora/blob/955e1c0c7bcd127361d66a74342a8ac19d651a2a/src/compare_main.cpp#L400-L402

  2. The error rate is reestimated after the mapping: https://github.com/rmcolq/pandora/blob/955e1c0c7bcd127361d66a74342a8ac19d651a2a/src/compare_main.cpp#L416-L417

  3. The reestimated error rate of sample i-1 is used then in the mapping of sample i: https://github.com/rmcolq/pandora/blob/955e1c0c7bcd127361d66a74342a8ac19d651a2a/src/compare_main.cpp#L400-L402

I guess it is a catch-22 situation: can't estimate error rate without mapping, but to map we need an error rate. This solution works by assuming that all datasets have a similar error rate, which a fair assumption to be honest. This, however, does not affect any of our evaluation. The error rate is just used in the binomial model: https://github.com/rmcolq/pandora/blob/955e1c0c7bcd127361d66a74342a8ac19d651a2a/src/estimate_parameters.cpp#L292

and is just updated if we are using the binomial model: https://github.com/rmcolq/pandora/blob/955e1c0c7bcd127361d66a74342a8ac19d651a2a/src/estimate_parameters.cpp#L271-L276

As we are always using negative binomial kmer coverage model (by default) this bug does not affect our analyses.

Actually, I don't think it is really a bug. I think the assumption that all datasets given to compare at a given time have similar error rate is fair. What do you think?

leoisl avatar Jun 05 '20 16:06 leoisl

i think it's fine to map with a default rate and then re-estimate better. I dont think it's a great assumption to assume all samples given to compare have the same error rate, but i can live with it for now - i mean if we make a per-sample estimate, it makes sense to use it per sample. i think we'd better properly assess whether we keep the binomial model in the code if its not being used, after the paper and release come out.

iqbal-lab avatar Jun 05 '20 16:06 iqbal-lab

Note that when using negative binomial model, we never update or re-estimate the error rate, we always use the default or the one given as argument. This impacts the mapping, which uses the error rate to define what is the size of the hits clusters:

https://github.com/rmcolq/pandora/blob/955e1c0c7bcd127361d66a74342a8ac19d651a2a/src/utils.cpp#L452

I wonder why error rate is not re-estimated when using the negative binomial model, seems like a very nice idea!

leoisl avatar Jun 05 '20 16:06 leoisl

makes sense - after the paper!

iqbal-lab avatar Jun 05 '20 16:06 iqbal-lab

Hmmm, just came across this again. I'm assuming this fix isn't present on dev branch? I'm wondering if this is having an impact on our compare results for the head to head @iqbal-lab ?

mbhall88 avatar Mar 11 '21 12:03 mbhall88

Hello, no, this fix isn't present on dev...

leoisl avatar Mar 11 '21 12:03 leoisl

AARGH i thought this was fixed

iqbal-lab avatar Mar 11 '21 15:03 iqbal-lab

This is needed for scalability, added label

leoisl avatar Oct 22 '21 08:10 leoisl