kraken2
kraken2 copied to clipboard
kraken2 2.1.3 empty taxonomic level field in k2report
Using the standard 16gb library and some bacterial reads we seem to get empty taxonomic level fields in the kraken2 report file. For example, the root and bacteria lines have no taxonomic level.
bracken does not like that... Are we doing something wrong ?
devarea@sx253:/scratch/tmp/Pseudomonas/kraken2-pipeline$ docker run -it -v $PWD:/analysis kraken2 bash -c "cd /analysis; kraken2 --db ./databases/standard_16gb --threads 4 --report D7055-2.k2report --report-minimizer-data --minimum-hit-groups 3 --gzip-compressed D7055-2.fasta.gz > D7055-2.kraken2"
Loading database information... done.
100000 sequences (12.56 Mbp) processed in 0.424s (14137.4 Kseq/m, 1774.97 Mbp/m).
78744 sequences classified (78.74%)
21256 sequences unclassified (21.26%)
devarea@sx253:/scratch/tmp/Pseudomonas/kraken2-pipeline$ head -n 20 D7055-2.k2report
21.26 21256 21256 0 0 U 0 unclassified
78.74 78744 1580 523338 339180 1 root
77.16 77164 0 509838 331368 1 131567 cellular organisms
77.16 77160 177 509352 330978 2 Bacteria
76.96 76964 2182 497689 323733 P 1224 Pseudomonadota
73.50 73497 504 458068 294546 C 1236 Gammaproteobacteria
69.57 69566 5 427073 267411 O 72274 Pseudomonadales
69.56 69561 962 426828 267031 F 135621 Pseudomonadaceae
68.60 68597 54681 409859 254097 G 286 Pseudomonas
13.79 13794 820 41188 25693 G1 136841 Pseudomonas aeruginosa group
12.96 12960 12889 36636 22788 S 287 Pseudomonas aeruginosa
0.01 10 2 12 10 S1 208964 Pseudomonas aeruginosa PAO1
0.01 7 7 9 7 S2 1147787 Pseudomonas aeruginosa PAO1H2O
0.00 1 1 1 1 S2 1367494 Pseudomonas aeruginosa PAO1-VE13
0.01 7 7 12 10 S1 1009714 Pseudomonas aeruginosa PAK
0.01 5 5 10 8 S1 381754 Pseudomonas aeruginosa PA7
0.01 5 5 10 10 S1 1400868 Pseudomonas aeruginosa VRFPA04
0.00 4 4 7 7 S1 1457392 Pseudomonas aeruginosa PA96
0.00 4 4 6 6 S1 1448140 Pseudomonas aeruginosa YL84
0.00 4 4 4 4 S1 1415629 Pseudomonas aeruginosa MTB-1
devarea@sx253:/scratch/tmp/Pseudomonas/kraken2-pipeline$ docker run -it -v $PWD:/analysis kraken2 bash -c "cd /analysis; kraken2 --version"
Kraken version 2.1.3
Copyright 2013-2023, Derrick Wood ([email protected])
devarea@sx253:/scratch/tmp/Pseudomonas/kraken2-pipeline$
devarea@sx253:/scratch/tmp/Pseudomonas/kraken2-pipeline/databases/standard_16gb$ head ktaxonomy.tsv
1 | 1 | R | 0 | root
10239 | 1 | D | 1 | Viruses
131567 | 1 | R1 | 1 | cellular organisms
2787854 | 1 | R1 | 1 | other entries
10472 | 10239 | F | 2 | Plasmaviridae
10474 | 10239 | F | 2 | Fuselloviridae
It seems to be restricted to toplevel domains:
karl@sx253:/scratch/tmp/Pseudomonas/kraken2-pipeline$ cat D7055-2.k2report | grep -P '\t\t'
78.74 78744 1580 523338 339180 1 root
77.16 77160 177 509352 330978 2 Bacteria
0.00 4 0 49 48 2759 Eukaryota
karl@sx253:/scratch/tmp/Pseudomonas/kraken2-pipeline$
Manually adding the correct R and D flags fixes the problem, bracken is happy.
I'd still like to know why this happens.
I've also run into the issue of report files missing rank codes for root and domains. Just recently when using the nt database and also previously when using a custom database after combining PlusPFP and my own sequence via --add-to-library.
For the nt database, its inspect.txt file shows all rank codes are present. Only the reports show them as missing.
Command I used for the runs:
kraken2 \
--confidence 0.1 \
--minimum-hit-groups 3 \
--gzip-compressed \
--db ${db} \
--threads ${SLURM_CPUS_ON_NODE} \
--output ${SLURM_JOB_NAME}.out \
--report ${SLURM_JOB_NAME}.report \
--classified-out ${SLURM_JOB_NAME}_class.fastq \
--unclassified-out ${SLURM_JOB_NAME}_unclass.fastq \
${reads}
I just ran into the same issue of missing rank codes with Kraken version 2.1.2, and both k2_pluspf_20220908 and k2_pluspf_20230605 db.
In my kraken reports, "D" is missing, but subdomain ranks ("D1") show up. "R" is also missing, and the subrank partially shows up ("1" instead of "R1"). This caused issues for me when using KrakenTools/extract_kraken_reads.py with the --include-children
flag. I resolved this by manually adding the missing rank codes.
My command:
kraken2 \
--db k2_pluspf_20220908 \
--gzip-compressed \
--threads 32 \
--minimum-hit-groups 3 \
--report-minimizer-data \
--output k2db_20220908/hits.txt \
--report k2db_20220908/report.txt \
--paired \
S18_R1.fastq.gz \
S18_R2.fastq.gz
Ugg, if you have hundreds of files to fix the following script might be useful. It replaces the missing classification of root, Bacteria, and Eukaryota of all kreports in the CWD with -,D,D respectively. Modify for your given database accordingly.
for val in *.kreport
do
fix=$(echo "$val"|sed -r 's/^(.+).kreport/\1_corrected.kreport/g')
sed -r 's/^([^\t]\t)([^\t]\t)([^\t]\t)([^\t]\t)([^\t]\t)(root)$/\1\2\3-\t\5\6/g' "$val" |
sed -r 's/^([^\t]\t)([^\t]\t)([^\t]\t)([^\t]\t)([^\t]\s+)(Bacteria)$/\1\2\3D\t\5\6/g' |
sed -r 's/^([^\t]\t)([^\t]\t)([^\t]\t)([^\t]\t)([^\t]*\s+)(Eukaryota)$/\1\2\3D\t\5\6/g' > "$fix"
done
@karlkashofer I have noticed this issue recently and am currently investigating.
I'll try to update soon. We are aware of the issue and are trying to address it.