KMC icon indicating copy to clipboard operation
KMC copied to clipboard

option to store min or max position of kmer in input sequences, instead of the kmer count

Open notestaff opened this issue 6 years ago • 0 comments

It'd be great to add an option to store for each kmer, instead of its count, the minimum or maximum position of that kmer in the input sequences. This would help with two distinct use cases:

  1. Validating kmers: when the input has many duplicate reads (e.g. from PCR duplicates), a kmer might have many occurrences that all come from duplicates of the same read. Then, kmer count in the reads is not a good indicator that a kmer is solid (represents real sequence). As an additional filter, one could demand that a kmer appear at a range of positions in input reads to be considered solid. So, one could make a database of kmers' min positions, max positions, then use 'kmc_tools simple counters_subtract' to get the range, and filter on the range value.

  2. Knowing which region of a genome each kmer comes from: when the input consists of genome sequences of different strains, which are all of roughly the same length, the min and max values give the approximate range of genome location(s) at which the kmer is found. One could then extract from the database kmers occurring in given genomic regions. Or, when filtering reads, one could get reads that contain kmers from a given genomic region.

For this use case, it would also be good if you could take a multiple alignment in FASTA form, where in each sequence you ignore the '-' characters, but when recording the min and max position of each kmer, you record the position in the alignment rather than in a given sequence.

@marekkokot

notestaff avatar Jun 28 '18 19:06 notestaff