kaiju icon indicating copy to clipboard operation
kaiju copied to clipboard

kaiju2table -- Meaning of "cannot be assigned to a (non-viral) X"

Open LeeBergstrand opened this issue 3 years ago • 9 comments

Problem Description:

I'm running the following command:.

kaiju2table -o ./narmena_results.tsv \
                -t nodes.dmp \
                -n names.dmp \
                -r species \
                -l superkingdom,phylum,class,order,family,genus,species \
                -e *.tsv

I'm interested in creating a table that can be imported into other microbiome tools. I want this table unfiltered other than the taxa levels listed after the -l flag. Any filtering I want to do will be after the table is created by kaiju2table. However, around 10% of my reads within this table are assigned a taxonomy of cannot be assigned to a (non-viral) species.

With the -r flag set to species, is kaiju2table binning all the reads with no species taxonomy assignment to the cannot be assigned to a (non-viral) species taxonomy category? Is it binning these reads into "cannot be assigned to a (non-viral) species" even if classified at higher taxonomic levels?

When looking at my Krona charts made with the kaiju2krona command, I don't see any "cannot be assigned to a (non-viral) species" category.

How would I get detailed taxonomy information for reads that have partial classification? For example, those where the read is known to be bacterial but not identified down to a species level.

Note:

https://github.com/bioinformatics-centre/kaiju/blob/d6e76d613648ab53ce02cc0af7027321439e70bc/src/kaiju2table.cpp#L341-L348

I was looking through the code for kaiju2table a noticed that it was formated in a way that made it look like an else statement may be missing in the above code block.

Problem Solution

If the -r flag does indeed bin the reads with no species classification, is there a workaround for getting more complete data? For example, I think one workaround would be using the output from kaiju2krona.

LeeBergstrand avatar Apr 15 '21 05:04 LeeBergstrand

How would I get detailed taxonomy information for reads that have partial classification?

You could try kaiju2krona as kaiju2table requires to set a rank and everything classified above that rank will be lumped together.

pmenzel avatar Apr 15 '21 20:04 pmenzel

@pmenzel Unfortunately, kaiju2kronaoutput will not work. Though, it does produce a tab-delimited file it does not produce a standard TSV where the same taxa level in each organism's taxa information is aligned into the same column.

root cellular organisms Eukaryota Opisthokonta Fungi Dikarya Ascomycota saccharomyceta Pezizomycotina
root cellular organisms Archaea Candidatus Thermoplasmatota Candidatus Poseidoniia Marine Group III Marine Group III euryarchaeote CG-Bathy2 NaN NaN
root cellular organisms Eukaryota Cryptophyceae Pyrenomonadales Pyrenomonadaceae Pyrenomonas Pyrenomonas salina NaN
root cellular organisms Archaea Euryarchaeota Stenosarchaea group Halobacteria Haloferacales Halorubraceae Halorubrum
root cellular organisms Bacteria Proteobacteria Gammaproteobacteria Pseudomonadales Moraxellaceae Acinetobacter unclassified Acinetobacter

LeeBergstrand avatar Apr 21 '21 00:04 LeeBergstrand

@pmenzel Can you write me a brief overview of how kaiju2table works? So essentially it reads in the results file, counts the frequency of each NCBI Taxa_ID, then maps these frequencies and Taxa_IDs to taxonomy information in the nodes.dmp file? How does names.dmp come into this?

LeeBergstrand avatar Apr 21 '21 00:04 LeeBergstrand

I'm considering using the ETE3 toolkit's NCBI library to generate the data I need. I am just wondering about the information in the results file that I could use for taxonomy mapping.

http://etetoolkit.org/docs/latest/tutorial/tutorial_ncbitaxonomy.html

LeeBergstrand avatar Apr 21 '21 00:04 LeeBergstrand

@pmenzel Can you write me a brief overview of how kaiju2table works? So essentially it reads in the results file, counts the frequency of each NCBI Taxa_ID, then maps these frequencies and Taxa_IDs to taxonomy information in the nodes.dmp file? How does names.dmp come into this?

names.dmp has the associated name for each taxon id, nodes.dmp just contains the the tree itself.

In principle, one could also add the option for setting a list of desired ranks to kaiju2krona.

pmenzel avatar Apr 21 '21 18:04 pmenzel

@LeeBergstrand I added option -l for specifying the ranks shown in the output to kaiju2krona in the latest commit.

pmenzel avatar Apr 23 '21 18:04 pmenzel

When using this option, there might be lines with identical taxon paths in the output, depending on the chosen ranks. So it might need some post-processing depending on the downstream analysis. Krona does not mind though.

pmenzel avatar Apr 23 '21 18:04 pmenzel

@pmenzel, I haven't tried your implementation yet. One of the problems that I ran into is that the Krona output generates a ragged TSV. In other words, the cells of each column are not all the same rank. If a taxonomy is missing a rank, the next rank is but end joined to the row, leaving the row short. Not sure if your changes address this issue.

I'm in the process of building a Python microbiome analysis library that supports Kaiju inputs. I've got the library up and running and importing Kaiju output files. My implementations parse column three ('NCBI taxon identifier of the assigned taxon') of the kaiju output file to get the NCBI taxon ID for each read. I then sum up these taxon ID counts to get taxon ID observations per sample. I then use the ete3 library to build an NCBI taxonomy database to get the taxonomy lineage for each observation. If then filter by rank and leave null values if ranks are missing. Does that sound like that would work? Though it's written Python, it uses NumPy, so it is fairly fast.

Is column three of the output the right column to use?

LeeBergstrand avatar May 07 '21 21:05 LeeBergstrand

That's more or less what kaiju2table and kaiju2krona are doing. Yes column 3 is the one with the taxon id. Just be aware that taxon ids change over time, so it's adviced to rely on nodes.dmp and names.dmp that were used to make the kaiju database.

pmenzel avatar May 12 '21 18:05 pmenzel