tools-iuc
tools-iuc copied to clipboard
Bracken data manager requires labels and help
I can't figure out what needs to be set, what the implication of Use Pre-built DB is, why do I need to select a kraken db, what happens if I set an optional Database Name, what the implications of K-mer length and Read length are.
The should all be described.
I spent so much time trying to understand the parameters used for building the "Pre-built" DB index on https://benlangmead.github.io/aws-indexes/k2 . What I found out is the "Pre-built" DB for Kraken2 DB comes with the Bracken DB index as well. Therefore, the existing Bracken Data Manager can reuse the same Pre-built Kraken2 DB link. For instance, this pre-built DB (https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20221209.tar.gz) contains the standard kraken2 and Bracken DB.
I wonder if Galaxy could use the same Kraken2 data manager for Bracken DB. I imagine it would be very annoying to write a separate bracken data manager wrapper. To use the Bracken DB, I manually edited the bracken_database.loc pointing to the same Kraken2 folder where containing Bracken DB index from the downloaded tarball above.
That is a very interesting and useful observation, thank you @mthang!
The prebuild DB contains indeed both the kraken2 and bracken files, from the readme of 16S_Greengenes_k2db
## Files Included
Kraken 2 Database files
* `hash.k2d`
* `opts.k2d`
* `taxo.k2d`
Bracken Files
* `database50mers.kmer_distrib`
* `database75mers.kmer_distrib`
* `database100mers.kmer_distrib`
* `database150mers.kmer_distrib`
* `database200mers.kmer_distrib`
* `database250mers.kmer_distrib`
* `seqid2taxid.map`
So for the prebuild indexes both can be downloaded from the same place, but it does make sense for me to have a different data manager for bracken, since bracken can also build custom DBs, for which it needs the kraken2 db ...
And also the Kraken DM does not outputs the kmer distributions for a selected read length, information needed for Bracken. The Bracken DM does that from a pre-built Kraken2 DB: create the datatable given the read length
@bebatut, but we could update the kraken2 DM to populate the bracken data table with the prebuild index, right ?
Yes probably
But if we want to have custom DBs, we need to be careful with having several data tables for Bracken
I think all bracken DBs should go to the same data table. And @mthang there is already a separate DM for bracken: https://github.com/galaxyproject/tools-iuc/tree/8ff9ada22d22cb94ddfff51bcdd3ab7d30104f1a/data_managers/data_manager_build_bracken_database
When digging into this I found, that there is one issue with this wrapper
The prebuilt option assumes there is already a bracken DB together with the kraken DB. But this is only the case for the DBs downloaded from https://benlangmead.github.io/aws-indexes/k2. For custom DBs, which can be built using the kraken2 DB this is not the case. Here, the DM should fail if the prebuilt option is used, or limit the options
Also, the admin can choose any value for
read_len, however only a few are available in the prebuilt DBs. So there should also be a check.
This is indeed the case :)
Will try to fix.
I can't really understand the read length being used in bracken database building process. I spent quite a bit of time to read this manual https://github.com/galaxyproject/tools-iuc/issues/5745 and try to understand this sentence "All packages contain a Kraken 2 database along with Bracken databases built for 50, 75, 100, 150, 200, 250 and 300-mers". I was assuming that's the kmers size for differrent Bracken DB index. Then, I tried to look for the description of the read length when I was making a loc file. I realized that the sentence was refering to the read length until digging into the Bracken DB wrapper.
I can't really understand the read length being used in bracken database building process. I spent quite a bit of time to read this manual #5745 and try to understand this sentence "All packages contain a Kraken 2 database along with Bracken databases built for 50, 75, 100, 150, 200, 250 and 300-mers". I was assuming that's the kmers size for differrent Bracken DB index. Then, I tried to look for the description of the read length when I was making a loc file. I realized that the sentence was refering to the read length until digging into the Bracken DB wrapper.
I think you are referring to the sentence from here: https://benlangmead.github.io/aws-indexes/k2 I think that is just wrong, you can find the readme of the DBs the actual commands they used to build those:
Bracken files (`*.kmer_distrib`) were generated using
```
bracken-build -k 35 -l 50 -d 16S_Greengenes_k2db -t 35
bracken-build -k 35 -l 75 -d 16S_Greengenes_k2db -t 35
bracken-build -k 35 -l 100 -d 16S_Greengenes_k2db -t 35
bracken-build -k 35 -l 150 -d 16S_Greengenes_k2db -t 35
bracken-build -k 35 -l 200 -d 16S_Greengenes_k2db -t 35
bracken-build -k 35 -l 250 -d 16S_Greengenes_k2db -t 35
```
In the bracken docs you can find that:
bracken-build -d ${KRAKEN_DB} -t ${THREADS} -k ${KMER_LEN} -l ${READ_LEN} -x ${KRAKEN_INSTALLATION} -y ${KRAKEN_TYPE}
`${KRAKEN_DB}` = location of the built Kraken 1/Kraken 2/KrakenUniq database
`${THREADS}` = number of threads to use with Kraken and the Bracken scripts
`${KMER_LEN}` = length of kmer used to build the Kraken database
Kraken 1/KrakenUniq default kmer length = 31
Kraken 2 default kmer length = 35
Default set in the script is 35.
`${READ_LEN}` = the read length of your data
-k is the kmer length and its all 35 for those DBs (which should be the case, since it should be the same as the bracken DB); whereas the -l flag should be the read length, which depends on the reads you want to use the DB on, therefore they supply multiple options.
So finally one can conclude, that the files in this DB are wrongly labelled database50mers.kmer_distrib should be database50readlenght.kmer_distrib IMYO; but I will double-check with @jenniferlu717 to be sure. (See https://github.com/jenniferlu717/Bracken/issues/249)
However, the arguments in Galaxy do use the correct wording, i.e. Read length in the DM does indeed choose the matching DB from the prebuilt DB.
See: https://github.com/galaxyproject/tools-iuc/pull/5804
Thank you for the explanation ! It will definitely be easier for the researchers/users to select which bracken DB to choose from.
can we close this one ?
I think so .