kmer-db icon indicating copy to clipboard operation
kmer-db copied to clipboard

Much fewer k-mer loaded from database during new2all

Open nekokoe opened this issue 2 years ago • 2 comments

Hi, I was running kmer-db with a relative large database, but I encountered a problem that greatly reduced the number of shared k-mers.

The database was constructed from all bacteria chromosomes from NCBI by KMC first, like this:

kmc -k26 -m384 -fm -ci3 -t32 ncbi_complete.no_plasmid.500k.fna ncbi_complete.k26.kmc-db ./kmc_tmp_dir


Stage 1: 100% Stage 2: 100% 1st stage: 316.529s 2nd stage: 189.729s Total : 506.258s Tmp size : 129684MB

Stats: No. of k-mers below min. threshold : 24841702410 No. of k-mers above max. threshold : 0 No. of unique k-mers : 29445663938 No. of unique counted k-mers : 4603961528 Total no. of k-mers : 128186981553 Total no. of sequences : 33304 Total no. of super-k-mers : 12865138640

then by Kmer-db "--from-kmers" option, like this:

kmer-db build -from-kmers k26.sample.list ncbi_complete.k26.kmer-db Kmer-db version 1.9.2 (16.08.2021) S. Deorowicz, A. Gudys, M. Dlugosz, M. Kokot, and A. Danek (c) 2018

Analysis started at Tue Jun 7 12:42:10 2022

Database building mode (from k-mers) Processing samples... 1/1

EXECUTION TIMES Total: 110.854 Kmer sorting/unique time: 12.2476 Database update time:8.48539 Hashatable processing (parallel): 5.42932 Resize: 3.53735 Find'n'add: 1.74812 Sort time (parallel): 0.96247 Pattern extension time (parallel): 1.98086

STATISTICS Number of samples: 1 Number of patterns: 2 (48 B) Number of k-mers: 308,994,232 K-mer length: 26 Minhash fraction: 1 Workers count: 256

Serializing database... Storing k-mer hashtables (raw)... 1048576/1048576 hashtables stored in 6.41024 s Storing patterns... 2/2 patterns stored in 4.67e-06 s

Releasing memory...OK (0.320644 seconds)

Analysis finished at Tue Jun 7 12:44:08 2022

then I made a test with one of the sequences used for kmer-db construction (test sequence was in theory included in kmer-db):

kmer-db new2all -t 32 -multisample-fasta ncbi_complete.k26.kmer-db sample.list.head sample.list.head.kmer-db.common_table Kmer-db version 1.9.2 (16.08.2021) S. Deorowicz, A. Gudys, M. Dlugosz, M. Kokot, and A. Danek (c) 2018

Analysis started at Tue Jun 7 12:47:36 2022

Set of new samples (from fasta genomes) versus entire database comparison Loading k-mer database ncbi_complete.k26.kmer-db...Loading k-mer hashtables (raw)... 1048576/1048576 hashtables loaded in 29.5659 s Loading patterns... 2/2 patterns loaded in 1.0741e-05 s OK (31.3595 seconds) Number of samples: 1 Number of patterns: 2 (0 B) Number of k-mers: 308,994,232 K-mer length: 26 Minhash fraction: 1 Workers count: 32

Storing matrix of common k-mers in sample.list.head.kmer-db.common_table Loading queries... Processing queries... 2060...

EXECUTION TIMES Total: 5.74091

Analysis finished at Tue Jun 7 12:48:14 2022

#and finally got this in common_table: kmer-length: 26 fraction: 1 ,db-samples ,ncbi_complete.k26.kmc-db, query-samples,total-kmers,4603961528, CP069828.1_sliding:1-200000,198481,10408, CP069828.1_sliding:50001-250000,198906,11280, CP069828.1_sliding:100001-300000,198905,11184, CP069828.1_sliding:150001-350000,197010,12038, CP069828.1_sliding:200001-400000,198079,12318, CP069828.1_sliding:250001-450000,199966,11886, CP069828.1_sliding:300001-500000,198072,11995, ......

The result is evidently incorrect, for the percent shared are expected to be 100%. I just don't quite understand why so much fewer kmers used by the kmer-db, but the total number came back in the common_table output.

Any workaround about this?

Many thanks!

nekokoe avatar Jun 07 '22 04:06 nekokoe

@nekokoe Thanks for reporting. I'll have a look on that!

agudys avatar Jun 08 '22 06:06 agudys

@nekokoe Sorry for the delay. As I can see, you built a single KMC database from all the genomes. This database, when given to the kmer-db, is treated as a single sample with 4,603,961,528 k-mers. However, the number of k-mers in a sample is limited by the implementation to 2^32 (4,294,967,296) which resulted in the incorrect result you obtained.

As currently kmer-db does not support such large samples, I suggest splitting your input into parts. In partiular, you may create a single KMC database for each genome so you can get number of common k-mers for each sample separately (maybe that was your intention from the start?).

Regards, Adam

agudys avatar Jul 18 '22 11:07 agudys