kaiju
kaiju copied to clipboard
Custom db question
Hi,
CONTEXT: metagenomic analyses using whole genome seq oxford nanopore data. Samples (15) collected in caves. I want to just focus on plants and bats so I built two different custom databases modifying the kaiju-taxonlistEuk.tsv:
- one db only with 33090 Viridiplantae
- one db only with 30560 Microchiroptera
ISSUES:
- even though the samples came from different locations, their compositions are suspiciously similar
- it founds some taxa that are not found in this geographical location.
- the Krona chart shows a high percentage of "other microchiroptera" but I cannot find this info in the table. It only shows the "unclassified" reads but these two percentages do not match.
QUESTION: Could it be that there is a problem with my custom databases or could it be due to the Nanopore type of data?
It would be very helpful if someone could give me some hints. Thanks
1+2: You probably get a lot of false positive matches. Try reducing the E-value threshold and/or the required score in Greedy mode. 3: I don't understand what you mean exactly.
Generally there is always the problem of false positive matches, especially when using a database with limited composition. Then it could easily happen that a read would have a good match to species X, but only species Y is in the database and it still matches the read, but with a lower score.
Thank you Peter! I have also had a read to your paper and it has been extremely helpful. There I have found many answers to my many questions.
NEW question: You mentioned that creating a db with limited composition can lead to false-positive matches. What would you then suggest to do if I only want to focus on Bats (Microchiroptera) and Plants (Viridiplantae)? What should I add to these 2 db?
1+2: Yes thanks, I will try to play around with the E-value and -s value in greedy mode but also changing the -e value (maybe trying with 0 mismatches) and also doing MEM runs with different -m values. Will also create a dummy fastq file with bat species that I know are found in this geographical location and see if Kaiju is able to see them.
3:
What is "other Microchiroptera"? I cannot find the "other Microchiroptera" row in the .tsv file. It only shows the % of unclassified.
data:image/s3,"s3://crabby-images/30dca/30dca6e8b30397eb0b6a00c396c4155d940820ec" alt="Screen Shot 2022-01-25 at 11 12 41 AM"
- Which also makes me wonder why I have rows 27 and 28 if I deleted the Virus from the kaiju-taxonlistEuk.tsv?
Thanks!
Hi,
I've figured out that Krona calculates the tot % excluding the unclassified reads (in the previous example it sums up to row 28) and what Krona called "other Microchiroptera" corresponds to "cannot be assigned to a (non-viral) species" in the kaiju report.
-
However, I have noticed that when using the greedy mode there is a slight difference in the tot % of around 1-2%. This seems not to happen when using the MEM mode, though. Here below an example of greedy mode. It sums up to 98%. What can be the cause of this?
-
Previously, you mentioned that creating a db with limited composition can lead to false-positive matches. What would you then suggest to do if I only want to focus on Bats (Microchiroptera) and Plants (Viridiplantae)? What should I add to these 2 db?
Thanks in advance