pandora
pandora copied to clipboard
Bug on gap computation if samples have different depths
The minimum kmer coverage parameter is used to define a threshold where a kmer is present or absent. Minimizer kmers with coverage less than this threshold are considered absent, and these absent kmers are considered as gaps of coverage when computing the likelihood of an allele. This threshold is set once for all samples, and equals 0.1 the expected kmer depth of the first sample:
https://github.com/rmcolq/pandora/blob/955e1c0c7bcd127361d66a74342a8ac19d651a2a/src/compare_main.cpp#L420-L421
This means that the minimum kmer coverage threshold for all samples is set based on the kmer depth of the first sample. This threshold is later used to compute the gaps:
https://github.com/rmcolq/pandora/blob/955e1c0c7bcd127361d66a74342a8ac19d651a2a/src/sampleinfo.cpp#L73-L76
when computing the likelihood of all alleles:
https://github.com/rmcolq/pandora/blob/955e1c0c7bcd127361d66a74342a8ac19d651a2a/src/sampleinfo.cpp#L102
thus affecting genotyping and genotype confidence, but I don't actually know what would be the impact of this bug in our evaluation.
Current code works fine if all samples have similar expected depth coverage.
I am currently working on https://github.com/rmcolq/pandora/issues/216. https://github.com/rmcolq/pandora/issues/216 will enable us to compare genotype confidences of samples with different depths because we will be comparing their percentiles. Thus, an implicit assumption is that pandora genotyping should not assume that all samples have a similar coverage anymore.
This bug is not hard to solve, so the solution will be implemented very soon. If there is another thing you might remember that might get affected by not assuming samples have similar coverage anymore, please tell me.
good spot! it used to be true that error rate was calculated for the first sample and then applied to all - is that bug gone or does it remain?
The error rate update bug is a convoluted integration bug I think...
-
The first sample is mapped using the default or argument error rate: https://github.com/rmcolq/pandora/blob/955e1c0c7bcd127361d66a74342a8ac19d651a2a/src/compare_main.cpp#L400-L402
-
The error rate is reestimated after the mapping: https://github.com/rmcolq/pandora/blob/955e1c0c7bcd127361d66a74342a8ac19d651a2a/src/compare_main.cpp#L416-L417
-
The reestimated error rate of sample
i-1is used then in the mapping of samplei: https://github.com/rmcolq/pandora/blob/955e1c0c7bcd127361d66a74342a8ac19d651a2a/src/compare_main.cpp#L400-L402
I guess it is a catch-22 situation: can't estimate error rate without mapping, but to map we need an error rate. This solution works by assuming that all datasets have a similar error rate, which a fair assumption to be honest. This, however, does not affect any of our evaluation. The error rate is just used in the binomial model: https://github.com/rmcolq/pandora/blob/955e1c0c7bcd127361d66a74342a8ac19d651a2a/src/estimate_parameters.cpp#L292
and is just updated if we are using the binomial model: https://github.com/rmcolq/pandora/blob/955e1c0c7bcd127361d66a74342a8ac19d651a2a/src/estimate_parameters.cpp#L271-L276
As we are always using negative binomial kmer coverage model (by default) this bug does not affect our analyses.
Actually, I don't think it is really a bug. I think the assumption that all datasets given to compare at a given time have similar error rate is fair. What do you think?
i think it's fine to map with a default rate and then re-estimate better. I dont think it's a great assumption to assume all samples given to compare have the same error rate, but i can live with it for now - i mean if we make a per-sample estimate, it makes sense to use it per sample. i think we'd better properly assess whether we keep the binomial model in the code if its not being used, after the paper and release come out.
Note that when using negative binomial model, we never update or re-estimate the error rate, we always use the default or the one given as argument. This impacts the mapping, which uses the error rate to define what is the size of the hits clusters:
https://github.com/rmcolq/pandora/blob/955e1c0c7bcd127361d66a74342a8ac19d651a2a/src/utils.cpp#L452
I wonder why error rate is not re-estimated when using the negative binomial model, seems like a very nice idea!
makes sense - after the paper!
Hmmm, just came across this again. I'm assuming this fix isn't present on dev branch? I'm wondering if this is having an impact on our compare results for the head to head @iqbal-lab ?
Hello,
no, this fix isn't present on dev...
AARGH i thought this was fixed
This is needed for scalability, added label