Bracken
Bracken copied to clipboard
Question: Would it be possible to concatenate the braken read assignments from 2 different databases (1 bacterial database & 1 viral database)?
Hi all,
I couldn't find relevant discussions or information on the topic and that's why I am initiating this discussion here. I tried searching for this topic in this GitHub repo, but couldn't find it.
So, I am trying to classify my metagenomic reads using 2 standard databases (GTDB & PlusPF ) separately. I ran Kraken-Braken (with the same database for both steps) on all my samples. So, now I have 2 feature tables (1 with the GTDB read counts and 1 with the PlusPF read counts).
Is it valid to merge unique features (for each database) from different databases for classification purposes? This is for expanding taxonomic coverage. I want to study the ecology of the taxa within these samples.
Why merge 2 different database features? GTDB has standardized and reassigned bacterial and archaeal genomes/taxa with the phylogeny information in addition to the genomic input. Therefore, I get more accurate within Bacterial and Archaeal kingdom classifications.
PlusPF database is created from genomes of almost all domains of life (excluding plant genomes). My samples most likely have a lot of viral genomes as well. I am only interested in the fungal and viral taxa from this database.
I read the Kraken and Braken papers, and if I understood it correctly each DNA-read is uniquely assigned to a taxon (within a database). And if the GTDB has only archaeal and bacterial genomes, then the unclassified reads should belong to the missing taxa/genomes. So, I am just adding the classification for those unclassified reads.
Unfortunately, merging the 2 databases is a huge task on its own. I don't want to go there. Is it fine if take only the bacterial read counts from GTDB bracken output (exclude unclassified of course) and just add the list of viral/fungal read counts (if any) in a given sample?
Does my logic make sense? Or will this violate any assumptions? Any pointers or discussion is welcome! Thank you for your time.
It sounds fine to me if you concatenate but only if the unclassified reads from DB 1 were being used against DB 2. I think if theres any overlap in what reads are classified, it doesnt work. I have not tested to see if this would give wildly different results but assuming the genomes are complete genomes and not contaminated, it should work(?)
I managed to create a database with genomes from both of these databases. And I could successfully run Kraken2 on the files. But thank you for the discussion and thus I'm closing this issue.