KMC
KMC copied to clipboard
kmc_tools filter does not accept large FASTA input
Hello, first of all, thank you for giving us KMC and kmc_tools, which I use frequently. Now I am trying to retrieve contigs from a genome assembly which contain kmers from a database using kmc_tools filter (ver. 3.2.1, 2022-01-04). The input to kmc_tools filter is thus in fasta format. Multiple fasta records are in the file (hundreds/thousands) but each sequence is on a single line, not "wrapped" / multi-line. Some sequences are >10 mega-bases or 100 mega-bases long, and the entire fasta file is >1 Gb in size. The input file parameter -fa (nor the undocumented -fm) does not behave as the help message suggests... I always get an
"Error: Wrong input file!"
Edit: this seems to be specific to the very long sequences in both FASTA and FASTQ format; the command succeeds when the sequences therein are only tens of kb long. Faking my genome contigs into FASTQ format does not help.
Many thanks and best regards, Mathias
Hi, thank you for using KMC and for reporting this issue. I guess something is wrong with handling long sequences in kmc_tools. I will try to take a look. Would be really helpful if you could share some of your input files causing this.