KMC icon indicating copy to clipboard operation
KMC copied to clipboard

kmc_tools simple intersect & union, not considering reverse complement?

Open zhaotao1987 opened this issue 5 months ago • 3 comments

Hello, in my project I would like to just calculate the kmer counts for one side of the reads, read_1.fq.gz, for example. And then afterwards, I would like to compare kmer.dump across samples, I would like to calculate shared kmers relative to total unique kmers, to this end, considering reverse complement is important, but I found kmc_tools intersect does not consider reverse complemented kmers between samples? Or, I have to do kmer calculation for both reads for each sample, then compare, that is the only way? Thanks very much.

zhaotao1987 avatar Jan 13 '24 10:01 zhaotao1987

Hi,

I am not fully understanding. In general, kmc_tools is unaware of strands (at least as far as I remember). It just takes k-mers as they are in the kmc database. Now, when you run kmc, you may count either:

  • k-mers, regardless of their reverse complement or
  • canonical k-mers, i.e., each k-mer is transformed to reverse complement, and the lower of these two (original and rev. comply) is chosen. By default, canonical k-mers are used—if you want the first option, use the -b switch for kmc run. If you want all the k-mers in the reverse form, it's not currently supported. We may consider adding this. For now, you could reverse all input reads and use this as a kmc input and run kmc with -b. Let me know if that is helpful, and maybe try to be more specific about what you want to achieve (maybe with some trivial example, a couple of samples, input files, sequences, and k-mers you want).

marekkokot avatar Jan 13 '24 11:01 marekkokot

Hi Marek,

Thank you for your prompt response; much appreciated. Great to learn that by default, KMC already considers reverse complement (RC).

I'm currently working on improvise something and would like to share my approach with you, seeking your input on its viability. I'm utilizing resequencing data from 300 samples, encompassing various species/subspecies within Malus. My goal is to explore the phylogenetic relationships among these samples based on kmer similarities, first to generate a kmer similarity matrix for all samples. The steps involve kmer counting for each sample and kmer comparison for every pair of samples ( kmc -> kmc_dump -> kmer_comparison.py ). To optimize efficiency in terms of time and space, I'm contemplating using only the R1 read for each sample.

So learned from what you said, if I used the -b mode, RC kmers from different samples might be treated as distinct kmers, thus introducing bias into the results. To clarify, for each comparison, I'm calculating shared (unique) kmers relative to total (unique) kmers. I'm pleased to learn that the software inherently considers both original and RC simultaneously by default. (oh, maybe still problematic... since the lower of these two was chosen , it may varied across samples.. ), I seems to me that kmc_tools simple intersection function does not consider RC?

Moving on to another aspect of my project, I have a question about the appropriate values for -ci and -cx when comparing shared kmers between samples. I think -ci should be set at 2 to keep all shared ones and discard error-prone kmers occurring only once. As for -cx, I've used 1000, but I'd appreciate your thoughts on this choice. The kmer size is 21.

Thank you once again for your assistance. BTW, I am not using kmc_tools intersection and union for calculating the ratio, which generates unnecessary big files. The python script is what I used for this purpose.

zhaotao1987 avatar Jan 14 '24 15:01 zhaotao1987