kraken2
kraken2 copied to clipboard
Classifying multiple samples
Hi,
I am using Kraken2 for classifying 16s amplicon data (I have around 100 samples). I have successfully built the SILVA database. However, I wanted to know about processing multiple samples. Can I process all the samples in a single run or will I need to run Kraken2 multiple times (one sample at a time).
Thanks, Ankit
This is also an problem for me - the database loading time is several minutes for each sample.
@DerrickWood Would it be feasible to implement this? It would be really helpful to be able to run kraken2 on multiple sample files at once, with a separate output file for each sample file, avoiding the need to load the database into memory repeatedly. I looked into the code to try to see how difficult this would be but couldn't get very far. My C++ is pretty rusty and I don't have any experience with Perl.
This involves some computer magic, but have you tried mapping/caching the database on your RAM? Related questions on Unix & Linux, serverfault and Stack Overflow.
I haven't tried this myself, but thought it might work for you.
There is another issue here asking for the same and someone has provided this feature. https://github.com/DerrickWood/kraken2/issues/233#issuecomment-629775384
Hello,
This can already be done in kraken2 today. Using k2
wrapper script together with the standard08gb database and sample files from data
directory, we can run the below command to classify multiple paired-end reads:
$ ./scripts/k2 classify --db standard_08gb --paired --output /dev/null data/HIV_1.fa data/HIV_2.fa data/FluA_H2N2.fa data/FluA_H1N1.fa
Loading database information... done.
9 sequences (0.05 Mbp) processed in 0.005s (116.4 Kseq/m, 604.61 Mbp/m).
9 sequences classified (100.00%)
0 sequences unclassified (0.00%)
The script will fail if given an odd number of files:
$ ./scripts/k2 classify --db standard_08gb --paired --output /dev/null data/HIV_1.fa data/HIV_2.fa data/FluA_H2N2.fa
--paired requires an even number of file names
Sequence counts:
$ grep '>' data/HIV_1.fa | wc -l
1
$ grep '>' data/HIV_2.fa | wc -l
1
$ grep '>' data/FluA_H1N1.fa | wc -l
8
$ grep '>' data/FluA_H2N2.fa | wc -l
8
This can already be done in kraken2 today. Using
k2
wrapper script together with the database and sample files fromdata
directory, we can run the below command to classify multiple paired-end reads:
Unfortunately, no... I tested this way, it does different thing: it performed analysis of SINGLE sample having several pairs of fastq files (e.g. from sequence of single library using several lanes). Namely, k2 produces only one output/report file for the entire batch of fastq pairs, rather than a batch of reports, one for each pair.
Ah, I misunderstood the question. This is not something that we support in the current wrappers but could be possible with post-processing. Here's an example AWK script that I think fulfills your request.
Given the following output file:
C NZ_PVMH01000002.1 5866 88223 0:34089 5866:10 0:98 5866:3 0:60 5866:2 0:5 5866:5 0:11 5866:54 0:22265 9606:4 5861:2 0:31581
C NZ_JAGFJR010000036.1 9606 143668 0:3521 94643:1 0:86079 9606:5 0:23389 9606:5 0:30634
C NZ_JAGFJR010000040.1 5866 5355 0:12 5866:54 0:99 5866:4 0:42 5866:79 0:9 5866:1 0:32 5866:120 0:38 5866:3 0:8 5866:20 0:135 5866:1 0:10 5866:1 0:58 5866:20 0:2 5866:5 0:4 5866:4 0:122 5866:1 0:41 5866:16 0:54 5866:1 0:10 5866:13 0:7 5866:1 0:54 5866:1 0:9 5866:22 0:4168 5866:5 0:35
U NZ_JAGFJR010000041.1 0 2096 0:2062
U NZ_JAGEOP010000044.1 0 61871 0:12238 94643:1 0:49598
U NZ_JAGEOP010000047.1 0 52048 0:20695 9606:3 0:31316
C NZ_JAGEOP010000049.1 5866 37697 0:1858 9606:1 0:31982 5866:1 0:10 5866:1 0:3810
C NZ_JAGEOP010000050.1 5866 6062 0:4233 5866:3 0:188 5866:5 0:21 5866:3 0:13 5866:15 0:360 5866:8 0:5 5866:1 0:19 5866:1 0:3 5866:34 0:54 5866:1 0:78 5866:28 0:955
C NZ_JAGEOP010000051.1 5866 4052 0:1904 5866:10 0:42 5866:5 0:11 5866:1 0:7 5866:3 0:78 5866:7 0:17 5866:1 0:5 5866:103 0:21 5866:3 0:13 5866:15 0:166 5866:5 0:1601
I can pipe this output to the following awk command that will output each classified sequence to a file named after the sequence name.
cat example.out | awk '/^C/ { outfile=$2".out"; print >> (outfile); close(outfile); }'
This produces the following output:
ls -l
-rw-r--r-- 1 kraken2 kraken2 88 2024-03-21 12:22 NZ_JAGFJR010000036.1.out
-rw-r--r-- 1 kraken2 kraken2 285 2024-03-21 12:22 NZ_JAGFJR010000040.1.out
-rw-r--r-- 1 kraken2 kraken2 82 2024-03-21 12:22 NZ_JAGEOP010000049.1.out
-rw-r--r-- 1 kraken2 kraken2 164 2024-03-21 12:22 NZ_JAGEOP010000050.1.out
-rw-r--r-- 1 kraken2 kraken2 165 2024-03-21 12:22 NZ_JAGEOP010000051.1.out
Let me know if this comes closer to what you're asking.
Let me know if this comes closer to what you're asking.
Dear ch4rr0, maybe this trick will do what we need, although I haven't tested it yet. I solved the problem using the simpler solution mentioned earlier, namely I used a RAM disk and placed kraken2 db on that RAM disk. Your solution may be faster and require less RAM than using a RAM disk.
Ah, I misunderstood the question. This is not something that we support in the current wrappers but could be possible with post-processing. Here's an example AWK script that I think fulfills your request.
Dear ch4rr0, yes, this trick works, I checked. There is one problem: read IDs in fastq files may not contain the sample ID. Moreover, read IDs may not be unique for the entire group of fastq files. This circumstance makes it difficult to restore the sample ID from the read ID. However, there is one workaround: add sample_id to each read_id in each fastq file.
Below is the bash + awk code to perform this task for 1 FASTQ file.
- inputs: $run_id, $file_in, $file_out
- take 1st line from each 4 line group (read id)
- use run_id as awk variable 'name'
- replace '@' char with @name:
awk -v name="$run_id" '{ if (NR%4==1) gsub("@","@"name":",$1); print }' $file_in > $file_out