biohansel Automate determining coverage minimum and maximum coverage

Automate determining coverage minimum and maximum coverage

Open Takadonet opened this issue 7 years ago • 4 comments

Instead of asking end user for these values, we should determine genome coverage based on the reads themselves.

Oct 23 '17 16:10 Takadonet

Current Solution (working):

Estimate the expected k-mer coverage depth through finding the maxima of the k-mer depth coverage values.
Calculate the error rate using

num_kmers_appear_once / num_unique_kmers

We do it this way, as we use the assumption that almost every k-mer that appears exactly once is an error.

Multiply error rate, and k-mer coverage depth to get the (slightly underestimated) expected k-mer coverage depth of errors.
Use a Poisson distribution with expected coverage value, and k-mer coverage depth of errors to pull out the minimum depth to be confident that the observation is not entirely caused by errors.

Which is min_kmer_coverage.

Nov 08 '17 15:11 mgopez

We decided we are not going to implement a MAX kmer coverage auto as we don't have a good reason to.

Nov 08 '17 15:11 mgopez

Since there is a plan to remove the dependency of Jellyfish we can no longer calculate a minimum k-mer value. Leaving this issue up in the mean time, but no more progress is being made on this.

Dec 19 '17 14:12 mgopez

We can leave the Jellyfish dependency for now and merge the auto min kmer threshold system into bio_hansel with the caveat that in order to do the auto-kmer threshold, the user will need to run Jellyfish and be okay with the analyses taking longer and using more computational resources.

Automatically determining the min coverage depth could be useful for other applications like setting min coverage for some de novo assemblers, setting min freq for kmers when running Mash with reads, as well as for setting the min k-mer threshold for bio_hansel.

So the code you've written could be extracted into a separate package and implemented as a generic tool with a wrapper for Galaxy if there's a good use case for it, which I think there is.

Dec 19 '17 15:12 peterk87

biohansel biohansel copied to clipboard

Automate determining coverage minimum and maximum coverage

biohansel
biohansel copied to clipboard