FastANI icon indicating copy to clipboard operation
FastANI copied to clipboard

minimizer/kmer string compression

Open jianshu93 opened this issue 3 years ago • 5 comments
trafficstars

Hello Chirag,

Does fastANI compress kmer/minimizer strings by default? I did not see it after checking. I realized that kmer counting from Heng Li's repo (based on kseq.h) (https://github.com/lh3/kmer-cnt/blob/master/kc-c1.c) compress AGCT into 0,1,2,3 et.al. We could do better actually to represent AGCT using only 2 bits memory(00, 01, 10, 11), Since fastANI consumes a lot of memory when running all versus all, I am wondering this could save a lot of memory. There are several Rust libraries that compression kmer into 2 bits and save a lot of memory (https://github.com/jean-pierreBoth/kmerutils/blob/master/src/base/alphabet.rs). I noticed there is also one here for C++: https://github.com/dassencio/dna-compression

Thanks,

Jianshu

jianshu93 avatar Oct 17 '22 16:10 jianshu93

Hi, ATCG is being represented only using 2 bits (00 is 0, 01 is 1, 10 is 2 and 11 is 3) https://github.com/lh3/kmer-cnt/blob/e2574719cfb784915d80eb5828e78dfae4cfdd7b/kc-c1.c#L36

cjain7 avatar Oct 18 '22 05:10 cjain7

Thanks Chirag,I also noticed this in that kc-c1.c

why all veesus all is consuming so many memory?any possibility to reduce somehow if dna string is already compressed.

Thanks

Jianshu

jianshu93 avatar Oct 18 '22 05:10 jianshu93

or we need to implement compression for fastANI?

Thanks.

Jianshu

jianshu93 avatar Oct 19 '22 09:10 jianshu93

Hello Chirag,

If there is no need to do string compression for fastANI, I will close this issue.

Thanks,

Jianshu

jianshu93 avatar Oct 26 '22 00:10 jianshu93

Sorry Jianshu, I am not clear what string compression means in this context. FastANI maintains a k-mer database extracted from all genomes, that is subsequently queried during mapping stage.

cjain7 avatar Oct 27 '22 07:10 cjain7