colord icon indicating copy to clipboard operation
colord copied to clipboard

Fixing maximum kmer count kmc flag

Open oxygen311 opened this issue 5 months ago • 2 comments

Dear CoLoRd developers team!

Hope you are doing well.

Issue:

While working with a CoLoRd in reference mode we have noticed strange behaviour in the case of length of reads << length of reference. Our investigations led to huge nodes in the similarity graph which are way more frequent than expected.

The only reason it usually works is this line: https://github.com/refresh-bio/colord/blob/25b28600d0716805beffab6941eb2c6b5f77014a/src/colord/reads_sim_graph.cpp#L390 But this condition is supposed to be true in the case of proper filtering.

We've noticed this in reference-based mode because that's no such condition to add a node to a graph from pseudo-reads.

Proposed Fix:

The fix is just switching to use the right flag of kmc tool, -cx instead of -cs. Here is a quotation from kmc help:

>  -ci<value> - exclude k-mers occurring less than <value> times (default: 2)
>  -cs<value> - maximal value of a counter (default: 255)
>  -cx<value> - exclude k-mers occurring more of than <value> times (default: 1e9)

It is also supposed to fix the logic of compression since kmers are supposed to be chosen based on count as well as hash.

Testing:

We have performed thorough testing, including specific scenarios with short reads and long references. Feel free to test it yourself.

Acknowledgments:

A special thanks to @iam28th for their assistance in tracing back to the kmc flag.

Hope this improvement gonna be helpful and improve results.

Best regards, Alexey

oxygen311 avatar Feb 03 '24 00:02 oxygen311