biohansel
biohansel copied to clipboard
Automate determining coverage minimum and maximum coverage
Instead of asking end user for these values, we should determine genome coverage based on the reads themselves.
Current Solution (working):
-
Estimate the expected k-mer coverage depth through finding the maxima of the k-mer depth coverage values.
-
Calculate the error rate using
num_kmers_appear_once / num_unique_kmers
We do it this way, as we use the assumption that almost every k-mer that appears exactly once is an error.
-
Multiply error rate, and k-mer coverage depth to get the (slightly underestimated) expected k-mer coverage depth of errors.
-
Use a Poisson distribution with expected coverage value, and k-mer coverage depth of errors to pull out the minimum depth to be confident that the observation is not entirely caused by errors.
- Which is min_kmer_coverage.
We decided we are not going to implement a MAX kmer coverage auto as we don't have a good reason to.
Since there is a plan to remove the dependency of Jellyfish we can no longer calculate a minimum k-mer value. Leaving this issue up in the mean time, but no more progress is being made on this.
We can leave the Jellyfish dependency for now and merge the auto min kmer threshold system into bio_hansel with the caveat that in order to do the auto-kmer threshold, the user will need to run Jellyfish and be okay with the analyses taking longer and using more computational resources.
Automatically determining the min coverage depth could be useful for other applications like setting min coverage for some de novo assemblers, setting min freq for kmers when running Mash with reads, as well as for setting the min k-mer threshold for bio_hansel.
So the code you've written could be extracted into a separate package and implemented as a generic tool with a wrapper for Galaxy if there's a good use case for it, which I think there is.