veba
veba copied to clipboard
[Question] Is it possible to use the VEBA Microeukaryotic database with MMSEQS2 taxonomy?
Hello, it's me again.
I wondered if the Microeukaryotic DB can be used with the MMSEQS2 taxonomy module. I am interested in using this database to assign taxonomy at the contig level rather than the whole genome. However, MMSEQS2 requires tax dump-like information from the database. Since the MIcroeukaryotic DB has a similar taxonomy format to GTDB, do you think it can be done? Perhaps you already have the nodes.dmp and names.dmp files?
Haha wow, I'm literally trying to solve this problem at this EXACT moment.
Ok so maybe you can help me out with this because what you mentioned is that exact use case I'm trying to build it for right now. First, do we need a taxdump file to build a MMSEQS2 database?
If so, I think I can use this for help: https://github.com/shenwei356/taxonkit/issues/56#issuecomment-1852830965
If not, what do we need to build it?
For the taxonomy database of MMSEQS2 you need the taxdump files: names.dmp, nodes.dmp, taxid.map, delnodes.dmp & merged.dmp
A few weeks ago I tried creating the taxdump files using taxonkit's function: create-taxdump
. See manual here.
I modified the humann_uniref50_annotations.tsv.gz file to have the input format the program requires, something like this:
e66f926af8418c8873d264785662420e c__Acantharea o__Arthracanthida f__Acanthometridae g__Amphilonche s__Amphilonche elongata
38c6e6957eabc901ba3b9c708a5e2eee c__Acantharea o__Arthracanthida f__Acanthometridae g__Amphilonche s__Amphilonche elongata
083441fc54d5eb9ecc0cd83490139b14 c__Acantharea o__Arthracanthida f__Phyllostauridae g__Phyllostaurus s__Phyllostaurus siculus
5df8df7ddea8c8316cfdfc02e9ceb988 c__Acantharea o__Arthracanthida f__Phyllostauridae g__Phyllostaurus s__Phyllostaurus siculu
it worked fine, but then the manual says that the order has to be fixed for it to work on MMSEQS2, but when I tested some sequences against the taxonomy Microeukaryotic DB, I only obtained unassigned taxonomy to contigs (which shouldn't be the case because I am 100% sure that I have sequences that are in the database). I think it has to do with "ordering" the taxonomy files when doing the fixing, but I haven't tested any longer, I had other projects to work on. I might come back to this if you are making advances and need help!
The humann_uniref50_annotations.tsv.gz
only has a subset of the data. You would need to use the source_taxonomy.tsv.gz
file. What is the MMSEQS2 command used once you have the taxdump file? I can look into this today and test on a few sets.
First, you create the database as mmseqs createdb sequence.fasta sequenceDB
with the reference fasta.
Then:
mmseqs createtaxdb sequenceDB tmp --ncbi-tax-dump ncbi-taxdump --tax-mapping-file taxidmapping
ncbi-taxdump
and taxidmapping
are the files that taxonkit
outputs. It did not raise any error for me, but since you say the file I used was a subset, then maybe there is the problem!
@franlat Excellent. This will be next on my to-do list on Monday.
e66f926af8418c8873d264785662420e c__Acantharea o__Arthracanthida f__Acanthometridae g__Amphilonche s__Amphilonche elongata
38c6e6957eabc901ba3b9c708a5e2eee c__Acantharea o__Arthracanthida f__Acanthometridae g__Amphilonche s__Amphilonche elongata
083441fc54d5eb9ecc0cd83490139b14 c__Acantharea o__Arthracanthida f__Phyllostauridae g__Phyllostaurus s__Phyllostaurus siculus
5df8df7ddea8c8316cfdfc02e9ceb988 c__Acantharea o__Arthracanthida f__Phyllostauridae g__Phyllostaurus s__Phyllostaurus siculu
When you used this file, was your taxonkit command the following?
cat humann_uniref50_annotations.tsv.gz \
| gzip -d \
| csvtk fix -t \
| csvtk cut -t -f id_source,class,order,family,genus,species\
| taxonkit create-taxdump -A 1 -O taxdump --rank-names class,order,family,genus,species
My understanding is that the above command would generate the ncbi-taxdump
file.
What is the format for the taxidmapping
file?
Yes, exactly like that! The taxidmapping file is the taxid.map that taxonkit creates. Looks like this;
e66f926af8418c8873d264785662420e 236192150
38c6e6957eabc901ba3b9c708a5e2eee 236192150
083441fc54d5eb9ecc0cd83490139b14 1443417319
5df8df7ddea8c8316cfdfc02e9ceb988 1443417319
84c8d9b87b512f3fabe68675d6b12bec 1443417319
413a5978d14b93c1b640172214bee751 1443417319
15e1cf336625fec63e2a77d06b107e11 1443417319
839345a945933f604310b98bd9b3aa35 1443417319
84f696d32ebc6504e7f34801e7b2174c 1296643208
For these:
e66f926af8418c8873d264785662420e 236192150
38c6e6957eabc901ba3b9c708a5e2eee 236192150
083441fc54d5eb9ecc0cd83490139b14 1443417319
5df8df7ddea8c8316cfdfc02e9ceb988 1443417319
84c8d9b87b512f3fabe68675d6b12bec 1443417319
413a5978d14b93c1b640172214bee751 1443417319
15e1cf336625fec63e2a77d06b107e11 1443417319
839345a945933f604310b98bd9b3aa35 1443417319
84f696d32ebc6504e7f34801e7b2174c 1296643208
Can the values in the second column be source identifiers instead NCBI taxids? Not all of the organisms have a ncbi_taxon
identifier unfortunately but they all have unique source identifiers.
These are not NCBI taxids, they are IDs that taxonkit creates automatically. So you don't have to worry about it, I think.
Ok, I built it. Our servers are undergoing maintenance today but I'll try testing it out today or tomorrow when they are available.
Just a heads up, I've been able to build it but I'm trying to figure out exactly how many resources I need to run mmseqs easy-taxonomy
because it keeps failing even when I use --split-memory-limit
to 75% of the memory I'm allocating. Circling around to this here and there.
I usually run just 'mmseqs taxonomy'. Were you able to make it work?
No not yet. I even tried it with a much smaller subset but still same error. Here's the MMSEQS2 chat: https://github.com/soedinglab/MMseqs2/issues/779
There might be another option:
(VEBA-binning-eukaryotic_env) [jespinoz@login02 output]$ metaeuk taxtocontig -h | head
usage: metaeuk taxtocontig <i:contigsDB> <i:predictionsFasta> <i:predictionsFasta.headersMap.tsv> <i:taxAnnotTargetDb> <o:taxResult> <tmpDir> [options]
By Eli Levy Karin <[email protected]>
options: prefilter:
--comp-bias-corr INT Correct for locally biased amino acid composition (range 0-1) [1]
--add-self-matches BOOL Artificially add entries of queries with themselves (for clustering) [0]
--seed-sub-mat TWIN Substitution matrix file for k-mer generation [nucl:nucleotide.out,aa:VTML80.out]
-s FLOAT Sensitivity: 1.0 faster; 4.0 fast; 7.5 sensitive [4.000]
-k INT k-mer length (0: automatically set to optimum) [0]
--k-score INT k-mer threshold for generating similar k-mer lists [2147483647]
--alph-size TWIN Alphabet size (range 2-21) [nucl:5,aa:21]
Hi! I'm glad to see you are still working on it. I am still very interested! I just checked metaeuk taxtocontig
, but I thought it used mmseqs2-like databases. Therefore, isn't it the same problem with building the mmseqs2 database?
I recently built a MMSEQS2 database for the new marine DB called MarFERReT. It is very similar to EukProt, but designed to be more specific with populations and species. The good thing compared to EukProt is that it includes NCBI taxids, and building the taxonomy for mmseqs2 was so much easier, as you can use the NCBI taxonomic files directly. Perhaps one solution is to try to give the VEBA microeukaryotic database a taxid for each entry. I'm not sure it is feasible since it might be a lot of work.
I've thought about that but the problem is that some of the databases I'm pulling from do not have NCBITaxIDs for the entries unfortunately. So it was the classic strict vs. relax scenario. From my understanding, the way MetaEuk and MMseqs2 classify contigs is a little different in terms of the database that is used but I'll definitely check it out.
In the meantime, here's the current source taxonomy table for MicroEuk100_v3: source_taxonomy.tsv.gz
I think I made it work? I used the source taxonomy file that you provided in the last comment and used taxonkit to generate the taxdump files. The taxonkit manual says that you have to order it, but I didn't this time. I ran some tests, and it turned out okay. Still, many contigs were unclassified
, but some were assigned correctly to what I was expecting. The biggest issue was memory usage. For a fasta file of 2.6 GB (in bytes not), mmseqs2 needed 180G of memory.
The files are too big for me to upload them here, but I can try a wetransfer if you need. Here's how the file that I gave taxonkit as input looks like:
df72f5b70f98518d320317439c41c753 Dothideomycetes Pleosporales Pleosporaceae Alternaria
ec750542edad93ce3f0770b338d8b761 Dothideomycetes Pleosporales Pleosporaceae Alternaria
336c7b7529ee5ddc5f30f79c7f2e79b9 Dothideomycetes Pleosporales Pleosporaceae Alternaria
ef3d0ea5c2258d0b5e467b53ea1078aa Dothideomycetes Pleosporales Pleosporaceae Alternaria
This is a great development. One parameter that you can use to get around this memory issue is using --split-memory-limit
:
--split-memory-limit BYTE Set max memory per split. E.g. 800B, 5K, 10M, 1G. Default (0) to all available system memory [0]
Some potential issues around this (that I think have been resolved): https://github.com/soedinglab/MMseqs2/issues/338
I typically use allocate 36G to split memory limit and then when I request from the servers I use 48G just in case it goes over a little bit.
The taxonkit manual says that you have to order it, but I didn't this time.
How did you order it?
The taxonkit manual says that mmseqs2 databases require the taxondump to be ordered, and provides a script to do it, but it doesn't work as intended. This time, I built the database with the taxonid mapping file that taxonkit creates directly. Indeed, I didn't use the whole source taxonomy the last time, so that might have been the issue there. In any case, I will continue using this version and see if I see something strange or not.
Hello! I've been testing lately doing contig taxonomic assignation with MMSEQS easy-taxonomy and VEBA microeuk v3 database (built as I mentioned in the last comments). I've been comparing it with the MarFERReT db doing the same approach, and the results are very similar. So, it works great. I guess I'll have to build it for the new database version!
I tested running the eukaryotic classify environment (it's running still the first version of the database, not v3) vs using mmseqs2 easy-taxonomy and summarizing the output manually. Using mmseqs2 turned out more accurate, but perhaps that has to do with the versions of the database being different. I guess this can also be used to assess "contamination", too.
I would like to know your insights on this matter!