kaiju icon indicating copy to clipboard operation
kaiju copied to clipboard

Custom db question

Open valentinavan opened this issue 2 years ago • 3 comments

Hi,

CONTEXT: metagenomic analyses using whole genome seq oxford nanopore data. Samples (15) collected in caves. I want to just focus on plants and bats so I built two different custom databases modifying the kaiju-taxonlistEuk.tsv:

  1. one db only with 33090 Viridiplantae
  2. one db only with 30560 Microchiroptera

ISSUES:

  1. even though the samples came from different locations, their compositions are suspiciously similar
  2. it founds some taxa that are not found in this geographical location.
  3. the Krona chart shows a high percentage of "other microchiroptera" but I cannot find this info in the table. It only shows the "unclassified" reads but these two percentages do not match.

QUESTION: Could it be that there is a problem with my custom databases or could it be due to the Nanopore type of data?

It would be very helpful if someone could give me some hints. Thanks

valentinavan avatar Jan 24 '22 03:01 valentinavan

1+2: You probably get a lot of false positive matches. Try reducing the E-value threshold and/or the required score in Greedy mode. 3: I don't understand what you mean exactly.

Generally there is always the problem of false positive matches, especially when using a database with limited composition. Then it could easily happen that a read would have a good match to species X, but only species Y is in the database and it still matches the read, but with a lower score.

pmenzel avatar Jan 24 '22 11:01 pmenzel

Thank you Peter! I have also had a read to your paper and it has been extremely helpful. There I have found many answers to my many questions.

NEW question: You mentioned that creating a db with limited composition can lead to false-positive matches. What would you then suggest to do if I only want to focus on Bats (Microchiroptera) and Plants (Viridiplantae)? What should I add to these 2 db?

1+2: Yes thanks, I will try to play around with the E-value and -s value in greedy mode but also changing the -e value (maybe trying with 0 mismatches) and also doing MEM runs with different -m values. Will also create a dummy fastq file with bat species that I know are found in this geographical location and see if Kaiju is able to see them.

3:
Screen Shot 2022-01-25 at 11 10 33 AM

What is "other Microchiroptera"? I cannot find the "other Microchiroptera" row in the .tsv file. It only shows the % of unclassified.

Screen Shot 2022-01-25 at 11 12 41 AM
  1. Which also makes me wonder why I have rows 27 and 28 if I deleted the Virus from the kaiju-taxonlistEuk.tsv?

Thanks!

valentinavan avatar Jan 25 '22 02:01 valentinavan

Hi,

I've figured out that Krona calculates the tot % excluding the unclassified reads (in the previous example it sums up to row 28) and what Krona called "other Microchiroptera" corresponds to "cannot be assigned to a (non-viral) species" in the kaiju report.

  1. However, I have noticed that when using the greedy mode there is a slight difference in the tot % of around 1-2%. This seems not to happen when using the MEM mode, though. Here below an example of greedy mode. It sums up to 98%. What can be the cause of this? Screen Shot 2022-02-03 at 10 26 14 AM

  2. Previously, you mentioned that creating a db with limited composition can lead to false-positive matches. What would you then suggest to do if I only want to focus on Bats (Microchiroptera) and Plants (Viridiplantae)? What should I add to these 2 db?

Thanks in advance

valentinavan avatar Feb 03 '22 01:02 valentinavan