KMC icon indicating copy to clipboard operation
KMC copied to clipboard

easy way to dump kmers with certain prefixes?

Open chunlinxiao opened this issue 6 years ago • 6 comments

Hi,

is there any fast way that kmc just dump kmers with certain prefixes, other than looking through the whole kmc database?

thanks

chunlin

chunlinxiao avatar Sep 14 '18 17:09 chunlinxiao

Hi,

It would be possible, but currently, our software does not support such an option. Nevertheless, you may implement it on your own (unfortunately it is rather not trivial). K-mers in kmc database (*.kmc_pre, *.kmc_suf files) are distributed among a number of (512 in default mode) bins. Each bin contains k-mers in sorted order (they are stores using additional LUT table to save a space and speed up search), so a binary search may be used to find boundaries of k-mers with specified prefixes in each bin. In some cases, however, k-mers are not distributed to bins (if kmc database is an output of most of kmc_tools operations, or when kmc was run with small k (about 14 or less)). There is a value that informs which case is in the given database.

You may always convert the most complex case of k-mers distributed to bins to simpler when all k-mers are sorted using kmc_tools sort operation.

The binary format of the database is described in this document: https://github.com/refresh-bio/KMC/blob/master/API.pdf In the case when all k-mers are sorted and not distributed to bins the database format is described here: https://static-content.springer.com/esm/art%3A10.1186%2F1471-2105-14-160/MediaObjects/12859_2012_5911_MOESM1_ESM.pdf

If you decide to implement it, you may use kmc_dump source code as a reference, because it supports both database formats. In general, kmc_dump uses kmc_api, whose source codes is avaiable here: https://github.com/refresh-bio/KMC/tree/master/kmc_api

You may also use kmc_tools source code as a reference, but it is more coplex and will require more time, but kmc_tools dump operation is a little faster than kmc_dump.

In case of any questions do not hesitate and ask.

We will consider adding support for dump operation witthe h specified prefix, but unfortunatelly even if we decide to implement it, it will be for sure not in the near future.

marekkokot avatar Sep 18 '18 09:09 marekkokot

thank you marekkokot very much for the detailed reply.

I just installed your kmc3.1.1 successfully from the source codes. But when I tried to compile kmc_dump_sample.cpp under directory of kmc_dump_sample using g++ ( 5.4.0 ), I have the following error:

g++ kmc_dump_sample.cpp

In file included from kmc_dump_sample.cpp:16:0: ../kmc_api/kmc_file.h:90:8: error: expected nested-name-specifier before ‘super_kmers_t’ using super_kmers_t = std::vector<std::tuple<uint32, uint32, uint32>>;//start_pos, len, bin_n ^ ../kmc_api/kmc_file.h:91:58: error: ‘super_kmers_t’ has not been declared void GetSuperKmers(const std::string& transformed_read, super_kmers_t& super_kmers);

did I miss anything?

thanks

chunlinxiao avatar Sep 18 '18 14:09 chunlinxiao

Hi, probably you should also specify -std=c++11, but there may be some missing references. Try to replace makefile content in the main directory with:

all: kmc
	
KMC_BIN_DIR = bin
KMC_MAIN_DIR = kmer_counter
KMC_API_DIR = kmc_api
KMC_DUMP_DIR = kmc_dump
KMC_DUMP_SAMPLE_DIR = kmc_dump_sample
KMC_TOOLS_DIR = kmc_tools

CC 	= g++
CFLAGS	= -Wall -O3 -m64 -static -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -std=c++11 
CLINK	= -lm -static -O3 -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -std=c++11 

KMC_TOOLS_CFLAGS	= -Wall -O3 -m64 -static -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -std=c++14
KMC_TOOLS_CLINK	= -lm -static -O3 -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -std=c++14

DISABLE_ASMLIB = false

KMC_OBJS = \
$(KMC_MAIN_DIR)/kmer_counter.o \
$(KMC_MAIN_DIR)/mmer.o \
$(KMC_MAIN_DIR)/mem_disk_file.o \
$(KMC_MAIN_DIR)/rev_byte.o \
$(KMC_MAIN_DIR)/bkb_writer.o \
$(KMC_MAIN_DIR)/cpu_info.o \
$(KMC_MAIN_DIR)/bkb_reader.o \
$(KMC_MAIN_DIR)/fastq_reader.o \
$(KMC_MAIN_DIR)/timer.o \
$(KMC_MAIN_DIR)/develop.o \
$(KMC_MAIN_DIR)/kb_completer.o \
$(KMC_MAIN_DIR)/kb_storer.o \
$(KMC_MAIN_DIR)/kmer.o \
$(KMC_MAIN_DIR)/prob_qual.o
RADULS_OBJS = \
$(KMC_MAIN_DIR)/raduls_sse2.o \
$(KMC_MAIN_DIR)/raduls_sse41.o \
$(KMC_MAIN_DIR)/raduls_avx2.o \
$(KMC_MAIN_DIR)/raduls_avx.o 

KMC_LIBS = \
$(KMC_MAIN_DIR)/libs/libz.a \
$(KMC_MAIN_DIR)/libs/libbz2.a

KMC_DUMP_OBJS = \
$(KMC_DUMP_DIR)/nc_utils.o \
$(KMC_DUMP_DIR)/kmc_dump.o 

KMC_DUMP_SAMPLE_OBJS = \
$(KMC_DUMP_SAMPLE_DIR)/kmc_dump_sample.o

KMC_API_OBJS = \
$(KMC_API_DIR)/mmer.o \
$(KMC_API_DIR)/kmc_file.o \
$(KMC_API_DIR)/kmer_api.o

KMC_TOOLS_OBJS = \
$(KMC_TOOLS_DIR)/kmc_header.o \
$(KMC_TOOLS_DIR)/kmc_tools.o \
$(KMC_TOOLS_DIR)/nc_utils.o \
$(KMC_TOOLS_DIR)/parameters_parser.o \
$(KMC_TOOLS_DIR)/parser.o \
$(KMC_TOOLS_DIR)/tokenizer.o \
$(KMC_TOOLS_DIR)/fastq_filter.o \
$(KMC_TOOLS_DIR)/fastq_reader.o \
$(KMC_TOOLS_DIR)/fastq_writer.o \
$(KMC_TOOLS_DIR)/percent_progress.o

KMC_TOOLS_LIBS = \
$(KMC_TOOLS_DIR)/libs/libz.a \
$(KMC_TOOLS_DIR)/libs/libbz2.a 

ifeq ($(DISABLE_ASMLIB),true)
	CFLAGS += -DDISABLE_ASMLIB
	KMC_TOOLS_CFLAGS += -DDISABLE_ASMLIB
else
	KMC_LIBS += \
	$(KMC_MAIN_DIR)/libs/libaelf64.a 
	KMC_TOOLS_LIBS += \
	$(KMC_TOOLS_DIR)/libs/libaelf64.a 
endif 	

$(KMC_OBJS) $(KMC_DUMP_OBJS) $(KMC_API_OBJS) $(KMC_DUMP_SAMPLE_OBJS): %.o: %.cpp
	$(CC) $(CFLAGS) -c $< -o $@

$(KMC_TOOLS_OBJS): %.o: %.cpp
	$(CC) $(KMC_TOOLS_CFLAGS) -c $< -o $@

$(KMC_MAIN_DIR)/raduls_sse2.o: $(KMC_MAIN_DIR)/raduls_sse2.cpp
	$(CC) $(CFLAGS) -msse2 -c $< -o $@
$(KMC_MAIN_DIR)/raduls_sse41.o: $(KMC_MAIN_DIR)/raduls_sse41.cpp
	$(CC) $(CFLAGS) -msse4.1 -c $< -o $@
$(KMC_MAIN_DIR)/raduls_avx.o: $(KMC_MAIN_DIR)/raduls_avx.cpp
	$(CC) $(CFLAGS) -mavx -fabi-version=0 -c $< -o $@
$(KMC_MAIN_DIR)/raduls_avx2.o: $(KMC_MAIN_DIR)/raduls_avx2.cpp
	$(CC) $(CFLAGS) -mavx2 -mfma -fabi-version=0 -c $< -o $@
$(KMC_MAIN_DIR)/instrset_detect.o: $(KMC_MAIN_DIR)/libs/vectorclass/instrset_detect.cpp
	$(CC) $(CFLAGS) -c $< -o $@
	
kmc: $(KMC_OBJS) $(RADULS_OBJS) $(KMC_MAIN_DIR)/instrset_detect.o 
	-mkdir -p $(KMC_BIN_DIR)
	$(CC) $(CLINK) -o $(KMC_BIN_DIR)/$@ $^ $(KMC_LIBS)
	
kmc_dump: $(KMC_DUMP_OBJS) $(KMC_API_OBJS)
	-mkdir -p $(KMC_BIN_DIR)
	$(CC) $(CLINK) -o $(KMC_BIN_DIR)/$@ $^

kmc_dump_sample: $(KMC_DUMP_SAMPLE_OBJS) $(KMC_API_OBJS)
	-mkdir -p $(KMC_BIN_DIR)
	$(CC) $(CLINK) -o $(KMC_BIN_DIR)/$@ $^
	
kmc_tools: $(KMC_TOOLS_OBJS) $(KMC_API_OBJS)
	-mkdir -p $(KMC_BIN_DIR)
	$(CC) $(KMC_TOOLS_CLINK) -o $(KMC_BIN_DIR)/$@ $^ $(KMC_TOOLS_LIBS)
	
clean:
	-rm $(KMC_MAIN_DIR)/*.o
	-rm $(KMC_API_DIR)/*.o
	-rm $(KMC_DUMP_DIR)/*.o
	-rm $(KMC_TOOLS_DIR)/*.o
	-rm $(KMC_DUMP_SAMPLE_DIR)/*o
	-rm -rf bin
	


all: kmc kmc_dump kmc_tools kmc_dump_sample

and than run:

make kmc_dump_sample

Let me know if it helps.

marekkokot avatar Sep 18 '18 18:09 marekkokot

thank you very much - your new makefile works !

Before this, I did try $ g++ kmc_dump_sample.cpp -std=c++11

but it would give me the error with "error: ld returned 1 exit status".

I also tried the following with additional options (from your makefile) - similar errors below - do you have any suggestion for command line compilation ?

$ g++ kmc_dump_sample.cpp -lm -static -O3 -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -std=c++11 -o test

kmc_dump_sample.cpp:(.text.startup+0x69): undefined reference to CKMCFile::CKMCFile()' kmc_dump_sample.cpp:(.text.startup+0x18c): undefined reference to CKMCFile::~CKMCFile()' kmc_dump_sample.cpp:(.text.startup+0x265): undefined reference to CKMCFile::OpenForListing(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)' kmc_dump_sample.cpp:(.text.startup+0x2b5): undefined reference to CKMCFile::Info(unsigned int&, unsigned int&, unsigned int&, unsigned int&, unsigned int&, unsigned int&, unsigned long long&, unsigned long long&)' kmc_dump_sample.cpp:(.text.startup+0x39e): undefined reference to CKMCFile::ReadNextKmer(CKmerAPI&, float&)' kmc_dump_sample.cpp:(.text.startup+0x436): undefined reference to CKmerAPI::char_codes' kmc_dump_sample.cpp:(.text.startup+0x479): undefined reference to CKmerAPI::char_codes' kmc_dump_sample.cpp:(.text.startup+0x49c): undefined reference to CKmerAPI::char_codes' kmc_dump_sample.cpp:(.text.startup+0x4c0): undefined reference to CKmerAPI::char_codes' kmc_dump_sample.cpp:(.text.startup+0x539): undefined reference to CKmerAPI::char_codes' /tmp/ccwLOWWX.o:kmc_dump_sample.cpp:(.text.startup+0x55c): more undefined references to CKmerAPI::char_codes' follow /tmp/ccwLOWWX.o: In function main': kmc_dump_sample.cpp:(.text.startup+0x5df): undefined reference to CKMCFile::ReadNextKmer(CKmerAPI&, unsigned int&)' kmc_dump_sample.cpp:(.text.startup+0x66e): undefined reference to CKmerAPI::char_codes' kmc_dump_sample.cpp:(.text.startup+0x6f8): undefined reference to CKMCFile::Close()' kmc_dump_sample.cpp:(.text.startup+0x740): undefined reference to CKMCFile::SetMaxCount(unsigned int)' kmc_dump_sample.cpp:(.text.startup+0x774): undefined reference to CKMCFile::SetMinCount(unsigned int)' kmc_dump_sample.cpp:(.text.startup+0x7c5): undefined reference to CKMCFile::~CKMCFile()' collect2: error: ld returned 1 exit status

chunlinxiao avatar Sep 18 '18 19:09 chunlinxiao

This is because in your command line you do not compile all necessary cpp files. If you really need to compile this way (which I do not recommend) use:

g++ -O3 -std=c++14 kmc_dump_sample.cpp ../kmc_api/*.cpp

Let me know if it works.

marekkokot avatar Sep 18 '18 19:09 marekkokot

yes it works fine - thank you very much Marek !

chunlinxiao avatar Sep 18 '18 19:09 chunlinxiao