KMC icon indicating copy to clipboard operation
KMC copied to clipboard

how to run KMC with assembly genome (fasta)

Open Aannaw opened this issue 3 years ago • 21 comments
trafficstars

hello I am confusion with the command for the kmc count kmer with assembly genome (fasta). I actually do not find an example. My command is kmc -k21 -ci0 -t40 -m20 -fa a.fasta ./tmp. No error is present but the program is clapsed. Looking forward with reply. Thanks very much.

Aannaw avatar Dec 14 '21 07:12 Aannaw

Hi,

there is also -fm switch for multi-fasta (fasta where sequence may span multiple lines). Let me know if it helps.

marekkokot avatar Dec 14 '21 07:12 marekkokot

I only have a assembly genome. Actually I want to assess my assembly genome after running purge_dups and another is to compare the kmer counts of the assembly genome and illumina short reads. I run -fm with only one fasta , and it seems useless.

Aannaw avatar Dec 14 '21 08:12 Aannaw

I don't know the purge_dups tools. You may count k-mers in multiple files. Assume you have a bunch of multi-fasta files. Create a file files.txt where per each line you store the path to one of the multi-fasta file. For example

file1.fa
file2.fa

You may run kmc as follows:

kmc -k21 -ci1 -t40 -fm @files.txt 21mers .

Does it help?

marekkokot avatar Dec 14 '21 08:12 marekkokot

I create a file a.txt with only a fasta files : a.fasta Then I run with kmc -k21 -ci1 -t40 -fm @a.txt tmp The standard out is: K-Mer Counter (KMC) ver. 3.1.0 (2018-05-10) Usage: kmc [options] <input_file_name> <output_file_name> <working_directory> kmc [options] <@input_file_names> <output_file_name> <working_directory> Parameters: input_file_list - file name with list of input files in specified (-f switch) format (gziped or not) Options: -v - verbose mode (shows all parameter settings); default: false -k - k-mer length (k from 1 to 256; default: 25) -m - max amount of RAM in GB (from 1 to 1024); default: 12 -d - trimmed-off bases; default: 0 -sm - use strict memory mode (memory limit from -m switch will not be exceeded) -p - signature length (5, 6, 7, 8, 9, 10, 11); default: 9 -f<a/q/m/bam> - input in FASTA format (-fa), FASTQ format (-fq), multi FASTA (-fm) or BAM (-fbam); default: FASTQ -ci - exclude k-mers occurring less than times (default: 2) -cs - maximal value of a counter (default: 255) -cx - exclude k-mers occurring more of than times (default: 1e9) -b - turn off transformation of k-mers into canonical form -r - turn on RAM-only mode -n - number of bins -t - total number of threads (default: no. of CPU cores) -sf - number of FASTQ reading threads -sp - number of splitting threads -sr - number of threads for 2nd stage -j<file_name> - file name with execution summary in JSON format -w - without output Example: kmc -k27 -m24 files.lst NA.res /data/kmc_tmp_dir/

No file is created and no error information is found.

Aannaw avatar Dec 14 '21 09:12 Aannaw

You have an message:

Usage:
kmc [options] <input_file_name> <output_file_name> <working_directory>
kmc [options] <@input_file_names> <output_file_name> <working_directory>

you miss the output_file_name in your command line, use:

kmc -k21 -ci1 -t40 -fm @a.txt output tmp

marekkokot avatar Dec 14 '21 09:12 marekkokot

It works! Thanks very much. Can I ask another question? About illumina paired short reads (a.1.fq,a.2.fq), should I run kemr count with creating a file a.fq.txt: a.1.fq a.2.fq and then run with "kmc -k21 -ci1 -t40 -fq @a.fq.txt out tmp"? Does it output the kmers common to the two paird short reads file?

Aannaw avatar Dec 14 '21 09:12 Aannaw

It will count each k-mer present in at least one of the input files. Probably for sequencing reads one should set some rationale cutoff (-ci) to remove erroneous k-mers.

marekkokot avatar Dec 14 '21 10:12 marekkokot

It is much helpful! Thanks very much

Aannaw avatar Dec 14 '21 10:12 Aannaw

No problem. I'm closing this issue. You may reopen if needed.

marekkokot avatar Dec 14 '21 11:12 marekkokot

Hi @marekkokot, I have the very same issue. No matter what combination of parameters I use, I always get a segfault. For example:

./kmc -v -fm -k31 -ci0 -m2 -t1 -sm ecoli1.fasta ecoli1.kmc kmc_tmp_dir

Why?

jermp avatar Dec 16 '21 08:12 jermp

Hi,

I don't think it is the very same issue. It looks much worse. Do you use kmc downloaded from the release page, or maybe from bioconda or maybe you have compiled it on your own? Let me know. Also, could you please send me your input file, i.e. ecoli1.fasta ?

marekkokot avatar Dec 16 '21 09:12 marekkokot

Hi, I cloned the repo from here (Github) and then compiled it on my machine. Compilation works file. Here is the file attached (it is a tiny file).

ecoli1.fasta.gz

jermp avatar Dec 16 '21 09:12 jermp

These are my commands:

./kmc -v -fm -k31 -ci0 -t1 ecoli1.fasta ecoli1.kmc kmc_tmp_dir
./kmc -v -fm -k31 -ci0 -t1 @list.txt ecoli1.kmc kmc_tmp_dir/

where currently list.txt contains the filepath of just that ecoli1.fasta file.

jermp avatar Dec 16 '21 09:12 jermp

It works on my machine. What is your operating system and compiler? And maybe what is your hardware? Just to be sure, do you have kmc_tmp_dir created?

marekkokot avatar Dec 16 '21 09:12 marekkokot

My running gcc on Ubuntu: gcc version 11.2.0 (Ubuntu 11.2.0-7ubuntu2) . I've also tried the release commit (b7de846829f7d8cfd18a3d1285deba6ee8ceffc2) but nothing changes. Of course, I have the tmp directory created.

jermp avatar Dec 16 '21 09:12 jermp

Ok, this is wired :( Could you please try the precompiled release? I may also try to remove -static flag from makefile and also -Wl,--whole-archive and -Wl,--no-whole-archive flags.

marekkokot avatar Dec 16 '21 09:12 marekkokot

I tried another machine of mine (Ubuntu again with gcc) and actually it worked. Very strange indeed. Everything else works correctly on the previous machine.

jermp avatar Dec 16 '21 09:12 jermp

It may be hard for me to track the cause when I am not able to reproduce the error. If you have some time maybe try to run kmc under gdb (some changes in makefile may be needed) to see where it crashes. Maybe, for some strange reason, kmc cannot allocate memory? How much memory does your machine have?

marekkokot avatar Dec 16 '21 09:12 marekkokot

My machines have 128GB of RAM :) Also, why not including some examples in the readme? I see a lot of people got confused or have no idea about how to run this tool. For example: I got these two files now

ecoli1.kmc.kmc_pre
ecoli1.kmc.kmc_suf

which one should I use?

jermp avatar Dec 16 '21 09:12 jermp

Ok, so this is not out of memory :) Strange :( Thanks for the suggestion. We indeed need to improve the readme. Some examples are given in the command line help. I didn't realize a lot of people got confused. This is bad. I thought the opposite is true.

Regarding kmc_pre and kmc_suf files. You should use both because kmc output is split into two files. Alternatively, you could set the output format to KFF, which would be a single file, but probably larger one.

marekkokot avatar Dec 16 '21 09:12 marekkokot

Ok thanks!

jermp avatar Dec 16 '21 10:12 jermp