kraken2 icon indicating copy to clipboard operation
kraken2 copied to clipboard

Classifying multiple samples

Open ankit4035 opened this issue 6 years ago • 9 comments

Hi,

I am using Kraken2 for classifying 16s amplicon data (I have around 100 samples). I have successfully built the SILVA database. However, I wanted to know about processing multiple samples. Can I process all the samples in a single run or will I need to run Kraken2 multiple times (one sample at a time).

Thanks, Ankit

ankit4035 avatar Feb 05 '19 10:02 ankit4035

This is also an problem for me - the database loading time is several minutes for each sample.

@DerrickWood Would it be feasible to implement this? It would be really helpful to be able to run kraken2 on multiple sample files at once, with a separate output file for each sample file, avoiding the need to load the database into memory repeatedly. I looked into the code to try to see how difficult this would be but couldn't get very far. My C++ is pretty rusty and I don't have any experience with Perl.

zmunro avatar Jul 29 '19 16:07 zmunro

This involves some computer magic, but have you tried mapping/caching the database on your RAM? Related questions on Unix & Linux, serverfault and Stack Overflow.

I haven't tried this myself, but thought it might work for you.

Dries-B avatar Sep 06 '19 15:09 Dries-B

There is another issue here asking for the same and someone has provided this feature. https://github.com/DerrickWood/kraken2/issues/233#issuecomment-629775384

Midnighter avatar Jun 22 '20 10:06 Midnighter

Hello,

This can already be done in kraken2 today. Using k2 wrapper script together with the standard08gb database and sample files from data directory, we can run the below command to classify multiple paired-end reads:

$ ./scripts/k2 classify --db standard_08gb --paired --output /dev/null data/HIV_1.fa data/HIV_2.fa data/FluA_H2N2.fa  data/FluA_H1N1.fa

Loading database information... done.
9 sequences (0.05 Mbp) processed in 0.005s (116.4 Kseq/m, 604.61 Mbp/m).
  9 sequences classified (100.00%)
  0 sequences unclassified (0.00%)

The script will fail if given an odd number of files:

$ ./scripts/k2 classify --db standard_08gb --paired --output /dev/null data/HIV_1.fa data/HIV_2.fa data/FluA_H2N2.fa

--paired requires an even number of file names

Sequence counts:

$ grep '>' data/HIV_1.fa | wc -l
        1

$ grep '>' data/HIV_2.fa | wc -l
        1

$ grep '>' data/FluA_H1N1.fa | wc -l
        8

$ grep '>' data/FluA_H2N2.fa | wc -l
        8

ch4rr0 avatar Apr 24 '23 17:04 ch4rr0

This can already be done in kraken2 today. Using k2 wrapper script together with the database and sample files from data directory, we can run the below command to classify multiple paired-end reads:

Unfortunately, no... I tested this way, it does different thing: it performed analysis of SINGLE sample having several pairs of fastq files (e.g. from sequence of single library using several lanes). Namely, k2 produces only one output/report file for the entire batch of fastq pairs, rather than a batch of reports, one for each pair.

nikyk avatar Mar 14 '24 14:03 nikyk

Ah, I misunderstood the question. This is not something that we support in the current wrappers but could be possible with post-processing. Here's an example AWK script that I think fulfills your request.

Given the following output file:

C	NZ_PVMH01000002.1	5866	88223	0:34089 5866:10 0:98 5866:3 0:60 5866:2 0:5 5866:5 0:11 5866:54 0:22265 9606:4 5861:2 0:31581
C	NZ_JAGFJR010000036.1	9606	143668	0:3521 94643:1 0:86079 9606:5 0:23389 9606:5 0:30634
C	NZ_JAGFJR010000040.1	5866	5355	0:12 5866:54 0:99 5866:4 0:42 5866:79 0:9 5866:1 0:32 5866:120 0:38 5866:3 0:8 5866:20 0:135 5866:1 0:10 5866:1 0:58 5866:20 0:2 5866:5 0:4 5866:4 0:122 5866:1 0:41 5866:16 0:54 5866:1 0:10 5866:13 0:7 5866:1 0:54 5866:1 0:9 5866:22 0:4168 5866:5 0:35
U	NZ_JAGFJR010000041.1	0	2096	0:2062
U	NZ_JAGEOP010000044.1	0	61871	0:12238 94643:1 0:49598
U	NZ_JAGEOP010000047.1	0	52048	0:20695 9606:3 0:31316
C	NZ_JAGEOP010000049.1	5866	37697	0:1858 9606:1 0:31982 5866:1 0:10 5866:1 0:3810
C	NZ_JAGEOP010000050.1	5866	6062	0:4233 5866:3 0:188 5866:5 0:21 5866:3 0:13 5866:15 0:360 5866:8 0:5 5866:1 0:19 5866:1 0:3 5866:34 0:54 5866:1 0:78 5866:28 0:955
C	NZ_JAGEOP010000051.1	5866	4052	0:1904 5866:10 0:42 5866:5 0:11 5866:1 0:7 5866:3 0:78 5866:7 0:17 5866:1 0:5 5866:103 0:21 5866:3 0:13 5866:15 0:166 5866:5 0:1601

I can pipe this output to the following awk command that will output each classified sequence to a file named after the sequence name.

cat example.out | awk '/^C/ { outfile=$2".out"; print >> (outfile); close(outfile); }'

This produces the following output:

ls -l
-rw-r--r--   1 kraken2       kraken2        88 2024-03-21 12:22 NZ_JAGFJR010000036.1.out
-rw-r--r--   1 kraken2       kraken2       285 2024-03-21 12:22 NZ_JAGFJR010000040.1.out
-rw-r--r--   1 kraken2       kraken2        82 2024-03-21 12:22 NZ_JAGEOP010000049.1.out
-rw-r--r--   1 kraken2       kraken2       164 2024-03-21 12:22 NZ_JAGEOP010000050.1.out
-rw-r--r--   1 kraken2       kraken2       165 2024-03-21 12:22 NZ_JAGEOP010000051.1.out

Let me know if this comes closer to what you're asking.

ch4rr0 avatar Mar 21 '24 16:03 ch4rr0

Let me know if this comes closer to what you're asking.

Dear ch4rr0, maybe this trick will do what we need, although I haven't tested it yet. I solved the problem using the simpler solution mentioned earlier, namely I used a RAM disk and placed kraken2 db on that RAM disk. Your solution may be faster and require less RAM than using a RAM disk.

nikyk avatar Mar 21 '24 18:03 nikyk

Ah, I misunderstood the question. This is not something that we support in the current wrappers but could be possible with post-processing. Here's an example AWK script that I think fulfills your request.

Dear ch4rr0, yes, this trick works, I checked. There is one problem: read IDs in fastq files may not contain the sample ID. Moreover, read IDs may not be unique for the entire group of fastq files. This circumstance makes it difficult to restore the sample ID from the read ID. However, there is one workaround: add sample_id to each read_id in each fastq file.

Below is the bash + awk code to perform this task for 1 FASTQ file.

  1. inputs: $run_id, $file_in, $file_out
  2. take 1st line from each 4 line group (read id)
  3. use run_id as awk variable 'name'
  4. replace '@' char with @name: awk -v name="$run_id" '{ if (NR%4==1) gsub("@","@"name":",$1); print }' $file_in > $file_out

nikyk avatar Apr 30 '24 09:04 nikyk