kraken2 icon indicating copy to clipboard operation
kraken2 copied to clipboard

Can the database be loaded into memory once to handle multiple paired reads files?

Open maruiqi0710 opened this issue 1 year ago • 21 comments

I need to detect the source of paired reads for a batch of sequencing data, and each time I need to reload the database into memory, the loading speed is very slow. Can the database be loaded into memory once to handle multiple paired reads files?

maruiqi0710 avatar May 19 '23 10:05 maruiqi0710

I need to detect the source of paired reads for a batch of sequencing data, and each time I need to reload the database into memory, the loading speed is very slow. Can the database be loaded into memory once to handle multiple paired reads files?

Kraken can't do this for you, but if you are on a system that isn't busy, the DB will remain in memory until it's kicked out - so the 2nd and subsequent batches will go much faster. If you're sharing a system with lots of others, though, it might not work that way.

salzberg avatar May 19 '23 15:05 salzberg

I own this server by myself, how to keep the database from being kicked out of memory.

kraken2=/Software/Bioinfo_Software/kraken2/kraken2 DBNAME=/home/XXX/Software_Database/kraken2/k2_pluspfp ######################################################################### input_dir=/media/clean_data out_dir=../1_result file_name=$(find $input_dir -maxdepth 2 -iname "*.clean.R1.fq.gz")

parallel --jobs 1 --header : --colsep ','
' file_pre=$(basename -s .clean.R1.fq.gz {file_name}) read_1={input_dir}/"$file_pre"/"$file_pre".clean.R1.fq.gz read_2={input_dir}/"$file_pre"/"$file_pre".clean.R2.fq.gz

sub_out_dir={out_dir}/"$file_pre" [ -d "$sub_out_dir" ] || mkdir -p "$sub_out_dir"

out_file="$sub_out_dir"/"$file_pre"_kraken2.txt report_file="$sub_out_dir"/"$file_pre"_kraken2_report.txt

{kraken2}
--db {DBNAME}
--threads 30
--output "$out_file"
--report "$report_file"
--use-names
--paired
"$read_1" "$read_2" '
::: kraken2 $kraken2
::: DBNAME $DBNAME
::: input_dir $input_dir
::: out_dir $out_dir
::: file_name $file_name

maruiqi0710 avatar May 19 '23 16:05 maruiqi0710

use --preload to make sure it all goes into memory the first time. It won't get kicked out.

salzberg avatar May 19 '23 16:05 salzberg

I want to confirm the options again. I added the option --preload

$kraken2 --db $DBNAME --threads 60 --output "$out_file" \ --report "$report_file" --use-names --preload $input_file

, but report "Unknown option: preload".

maruiqi0710 avatar Jun 09 '23 11:06 maruiqi0710

If your system memory is twice as big as your kraken2 db, you can copy the db into /dev/shm and run subsequent kraken2 calls with it. This is the copy to memory the manual way.

slw287r avatar Jun 09 '23 11:06 slw287r

I want to confirm the options again. I added the option --preload

$kraken2 --db $DBNAME --threads 60 --output "$out_file"
--report "$report_file" --use-names --preload $input_file

, but report "Unknown option: preload".

Sorry, that's an option for krakenuniq but not for kraken2. I wasn't clear before.

salzberg avatar Jun 09 '23 11:06 salzberg

Late to the party, but if you copy your DB into /dev/shm/ and run kraken2 with --memory-mapping then you only need enough RAM to fit your DB. not twice as much.

Make sure to delete the ramdisk DB once done :D

thomcuddihy avatar Jul 12 '23 00:07 thomcuddihy

Is there a reason why --preload isnt available as an option for kraken2?

MADscientist314 avatar Aug 02 '23 20:08 MADscientist314

If your system memory is twice as big as your kraken2 db, you can copy the db into /dev/shm and run subsequent kraken2 calls with it. This is the copy to memory the manual way.

It's such an amazing way and has saved me tons of times. Thanks!!! Hope that this tip can be added to the official manual.

HuoJnx avatar Jan 22 '24 04:01 HuoJnx

I also need to find a way to preload database(s) into memory or swap. Whatever I did kraken2 always loads the databases into memory first. --memory-mapping is taking a long long time no luxury to use it . I am using 72Gb standart DB.

VM SPECS: 4 CPU 8GB RAM 300 + 100 + 100 SSD Space.

Things I tried

  • I have a 100Gb swap drive mounted as swap space for aditional memory.
  • I have a 100Gb external drive to /dev/shm (I think it has to be a tmpfs)
  • copied database into /dev/shm expecting preload behaviour but failed (with and without --memory-mapping) -> this can be because I am using an external drive as /dev/shm and it might be that Linux needs this as a tmpfs. Not really sure about this.
  • when I use --memory-mapping I realized the threads can not use more than %60 percent of 4 CPU's available on VM and it takes forever to finish but when I dont use --memory-mapping all threads (processes) can use up to %100 of CPUs available.

with memory mapping image

without memory mapping image

  • tried vmtouch utiliy which I found most helping on some other systems to preload database files (directory) into memory but failed. Kraken2 does not check if database files are already in memory or not. checkout below
base) root@pato:~# find /dev/shm/k2_standard_20240112 -type f -print0 | xargs -0 vmtouch -vt
/dev/shm/k2_standard_20240112/database300mers.kmer_distrib
[OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 704/704
/dev/shm/k2_standard_20240112/database250mers.kmer_distrib
[OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 758/758
/dev/shm/k2_standard_20240112/database50mers.kmer_distrib
[OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 1127/1127
/dev/shm/k2_standard_20240112/opts.k2d
[O] 1/1
/dev/shm/k2_standard_20240112/taxo.k2d
[OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 934/934
/dev/shm/k2_standard_20240112/hash.k2d
[OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 18824189/18824189
/dev/shm/k2_standard_20240112/library_report.tsv
[OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 10275/10275
/dev/shm/k2_standard_20240112/database200mers.kmer_distrib
[OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 815/815
/dev/shm/k2_standard_20240112/ktaxonomy.tsv
[OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 605/605
/dev/shm/k2_standard_20240112/seqid2taxid.map
[OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 1588/1588
/dev/shm/k2_standard_20240112/database100mers.kmer_distrib
[OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 975/975
/dev/shm/k2_standard_20240112/inspect.txt
[OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 803/803
/dev/shm/k2_standard_20240112/database150mers.kmer_distrib
[OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 888/888
/dev/shm/k2_standard_20240112/unmapped_accessions.txt
[OO] 2/2
/dev/shm/k2_standard_20240112/database75mers.kmer_distrib
[OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 1043/1043

           Files: 15
     Directories: 0
   Touched Pages: 18844707 (71G)
         Elapsed: 142.88 seconds
  • tried to cat <filename> > /&dev/null to make the os to cache all the files. no luck.

Can we modify the codes so that

  • kraken2 respect already loaded databases ?
  • kraken2 to use an external key/value store located somewhere else (REDIS vs maybe) or even better a NOSQL database ?

Slamoth avatar May 02 '24 13:05 Slamoth

I'm afraid that there's a world of difference between the IO of a ramdisk and a hard drive (even an SSD) You can see in your first screenshot that the kraken2 processes are in D state, waiting on IO (using memory mapping with a database mounted on an external drive at /dev/shm) In your second screenshot, it's putting the db into swap (see the increased usage in htop), which as you said is on a separate SSD that I'm assuming has a better interface than the external drive mounted on /dev/shm and has better IO That said, IO is clearly the bottleneck as you can see from the load averages If you can get access to a VM with even a bit more RAM, you can use the *-8 and/or *-16 databases from Langmead et al. that are capped by 8 and 16GB respectively, then put it into the ramdisk /dev/shm so you can run kraken2 in bulk with --memory-mapping

thomcuddihy avatar May 07 '24 00:05 thomcuddihy

I am using a custom 220GB index on 256GB RAM machine with 40 cores.

Screenshot 2024-07-03 at 22 38 56

A fastq file with 110k lines is taking ~20 minutes to run.

$ time kraken2 --threads 40 --db custom_db --report r10k.txt s10k.fastq  
2.06s user 453.95s system 36% cpu 21:03.71 total

When I copy *.k2d (3 files) files in db to /dev/shm and run the script with --memory-mapping option, it is taking forever. Even after 2+ hours it is still running.

$ time kraken2 --threads 40 --db custom_db --report r10k.txt s10k.fastq --memory-mapping

Shouldn't it be much faster now?

ChillarAnand avatar Jul 03 '24 17:07 ChillarAnand

The size of /dev/shm defaults to half of OS RAM and in your case it is 128G, which is larger than the custom db. To hold kraken2 db in it, you can resize the /dev/shm mount with roots privilege following the link https://stackoverflow.com/questions/58804022/how-to-resize-dev-shm

Edit /etc/fstab to increase the size to slightly larger than the db

none /dev/shm tmpfs defaults,size=230G 0 0

and remount:

$ mount -o remount /dev/shm

or perform one time resize:

$ sudo mount -o remount,size=230G /dev/shm

slw287r avatar Jul 04 '24 01:07 slw287r

@slw287r Thanks for the suggestion. I have resized and also updated the command so that db points to /dev/shm instead of default db.

➜  df -h
Filesystem                         Size  Used Avail Use% Mounted on
tmpfs                               26G  3.8M   26G   1% /run
tmpfs                              225G  203G   23G  90% /dev/shm

Run kraken2 as follows. Note that --db points to /dev/shm/ not original db.

➜ time kraken2 --threads 40 --db /dev/shm/ --report r.txt --memory-mapping foo.fastq > o.txt

ChillarAnand avatar Jul 04 '24 03:07 ChillarAnand

Without memory map

➜  time kraken2 --threads 40 --db custom_db --report r.txt sample.fastq > o.txt
Loading database information... done.
193213 sequences (155.88 Mbp) processed in 4728.601s (2.5 Kseq/m, 1.98 Mbp/m).
  177308 sequences classified (91.77%)
  15905 sequences unclassified (8.23%)
kraken2 --threads 40 --db 250k_index --report r.txt  > o.txt  
104.52s user 595.08s system 12% cpu 1:35:32.52 total

With memory map

➜  time kraken2 --threads 40 --db /dev/shm/ --report r.txt --memory-mapping sample.fastq > o.txt
Loading database information... done.
193213 sequences (155.88 Mbp) processed in 3437.083s (3.4 Kseq/m, 2.72 Mbp/m).
  177308 sequences classified (91.77%)
  15905 sequences unclassified (8.23%)
kraken2 --threads 40 --db /dev/shm/ --report r.txt --memory-mapping  > o.txt  
76.06s user 123.81s system 5% cpu 57:24.54 total

It is almost 40 minutes faster now.

CPU: Model name: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz, 40 core RAM: 256GB Index size: 202 GB

Do you have any suggestions on how to improve the speed further?

I am targeting to complete a run in ~15 minutes.

ChillarAnand avatar Jul 04 '24 10:07 ChillarAnand

Not sure if getting more system RAM helps in your case but you can specify a reasonable confidence score such as --confidence 0.1 to drop reads with lower score and save the extra calculation time.

BTW, contents in /dev/shm/ may get swapped in to hard disk if there are not enough physical RAM available, once this happens, the database loading process may hang for ever and vmtouch may help.

slw287r avatar Jul 04 '24 11:07 slw287r

@slw287r I am using 256 GB machine and index size is ~200 GB. No other processes are running in the machine. It is highly unlikely that the RAM gets swapped.

Thanks for pointing out vmtouch. I will make a note of that.

My friend was using this same machine with the same index and the same sample file. He mentioned that the process(using the same command above) used to complete in ~20 minutes. Recently ubuntu server crashed and after re-installing ubuntu this process is taking ~90 minutes.

Just wondering if I am missing anything else in optimizing the performance.

ChillarAnand avatar Jul 05 '24 06:07 ChillarAnand

@ChillarAnand May be you can try the following steps to check further

  • Clean shm
rm -rf /dev/shm/*
  • Evict custom_db from OS cache
time vmtouch -e /path/to/custom_db/
  • Copy db to /dev/shm (RAM)
time cp /path/to/custom_db/* /dev/shm/
  • Run classification with --memory-mapping
time kraken2 --threads 40 --db /dev/shm/ --report r.txt --memory-mapping sample.fastq > o.txt
  • Run classification without --memory-mapping
time kraken2 --threads 40 --db /dev/shm/ --report r.txt sample.fastq > o.txt

slw287r avatar Jul 05 '24 09:07 slw287r

Thanks, @slw287r

I tried the same sequence you mentioned. It is still taking the same time.

However, I noticed that kraken2 is somehow caching the results. For example, if I run a sample for the first time, it takes 1 hour or so. Next time if I run the same sample, it will complete in ~3 minutes. Not sure how this caching is implemented.

ChillarAnand avatar Jul 06 '24 17:07 ChillarAnand

After re-installing Ubuntu on the HDD, the performance of kraken2 has improved significantly. Now we are able to run 2GB sample files in less than a minute. Not sure how there was such a significant performance improvement.

ChillarAnand avatar Jul 09 '24 14:07 ChillarAnand

Thanks @slw287r & everyone for the help.

I have improved classification time from ~100 minutes to ~10 seconds.

I also wrote a detailed blog post and how time is improved at each stage with benchmarks. You can take a look if you are looking into optimising the speed.

https://avilpage.com/2024/07/mastering-kraken2-performance-optimisation.html

ChillarAnand avatar Jul 28 '24 07:07 ChillarAnand