MMseqs2
MMseqs2 copied to clipboard
gtdb taxonomy build
Expected Behavior
map queries to GTDB then get taxonomy lca of best hits
Current Behavior
-all taxonomy results are "unclassified" when mapping to GTDB -the gtdb_mapping file has only a single column (0..256163114) which may be the problem -I think the gtdb.lookup table may be wrong.
Steps to Reproduce (for bugs)
-download gtdb.tar.gz
mmseqs createdb tardb gtdb
-execute shell commands in databases.sh to create nodes.dmp, names.dmp, mapping_genomes (these files all seem to be fine)
mmseqs createtaxdb gtdb tmp/taxdb --ncbi-tax-dump tmp/taxonomy --tax-mapping-file "tmp/taxonomy/mapping_genomes" --tax-mapping-mode 1
MMseqs Output (for bugs)
-no errors or warnings
createtaxdb gtdb tmp/taxdb --ncbi-tax-dump tmp/taxonomy --tax-mapping-file tmp/taxonomy/mapping_genomes --tax-mapping-mode 1
MMseqs Version: 14.7e284
NCBI tax dump directory tmp/taxonomy
Taxonomy mapping file tmp/taxonomy/mapping_genomes
Taxonomy mapping mode 1
Taxonomy db mode 1
Threads 28
Verbosity 3
if I understand correctly, the gtdb.lookup file should have 3 columns: proteinNumber (0.. 256163114), proteinID, sourceGenomeNumber :
0 NZ_CP009512.1_1 1
1 NZ_CP009512.1_33 1
2 NZ_CP009512.1_65 1
3 NZ_CP009512.1_97 1
4 NZ_CP009512.1_129 1
5 NZ_CP009512.1_161 1
6 NZ_CP009512.1_193 1
7 NZ_CP009512.1_225 1
8 NZ_CP009512.1_257 1
9 NZ_CP009512.1_289 1
10 NZ_CP009512.1_321 1
however, my gtdb.lookup file has entries from different genomes pointing to the same source genome number. e.g.
0 NZ_CP009512.1_1 1
100 NZ_CP009512.1_3201 1
101 NZ_CP009512.1_3233 1
102180795 NZ_JAAOIZ010000001.1_5 1 #should point to source genome 65537
102180796 NZ_JAAOIZ010000001.1_37 1 #should point to source genome 65537
102181002 JAJXZV010000318.1_2 2 #should point to source genome 65538
102181003 JAJXZV010000021.1_4 2 #should point to source genome 65538
104066383 NC_007181.1_8 2
104066384 NC_007181.1_40 2
the highest source genome number found in the gtdb.lookup file is 65535, and then it seems to start over at 1 duplicating ids...?
Your Environment
MMseqs2 Version: 14.7e284 tried with both latest binary download and conda install Linux my.local 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
hmm, I just noticed that the --max-seq-len search parameter is 65535...
fyi in case anyone else is having this problem... I was able to successfully run the GTDB taxonomy after manually re-making the gtdb.lookup file and a gtdb.taxid.mapping file, and running createtaxdb with this taxid mapping file.
@mlcoleman
I have the same problem after executing the shell commands in download.sh
manually.
For the gtdb.lookup
file, did you just fix the third column to not start at 1 again after 65535? How did you create and use the gtdb.taxid.mapping
file (do you mean adding the tax-ids to gtdb_mapping
which has only 1 column now?):
I would make a protein-id vs genome table and then (fasta headers from protein_faa_reps
) and merge with the proteins IDs in the new gtdb.lookup
and the tax-ids in tmp/taxonomy/mapping_genomes
edit: I renumbered the genome ids in gtdb.lookup
to go from 1 to 85205 and then replaced the gtdb_mapping
file with one that links each protein-id (column1) to the tax-id of the corresponding genome (column2), tax-ids are from taxonomy/mapping_genomes
. I hope that is what it is supposed to be
I encountered the same issue while working with mmseqs version 14-7e284 and GTDB r214, using a Linux server (Ubuntu 22.04 LTS; CPU with SSE/AVX2 support and 512 GB RAM). Although there is a lot of overlap with the post by @mlcoleman previously and the discussion by @mherold1 , I include some details below about the problems I ran into and how I addressed them.
Problem description
After making a seqdb for the GTDB, there seems to be a collision of genome IDs between different genomes in the seqdb's gdtb.lookup
file. As mentioned by @mlcoleman , these ID collisions seem to disrupt mmseqs taxonomy workflows downstream.
Steps to reproduce
- Download
gtdb.tar.gz
as done in thedata/workflow/databases.sh
script (or you can just run:wget -o gdtb.tar.gz https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/genomic_files_reps/gtdb_proteins_aa_reps.tar.gz
) - Import
gtdb.tar.gz
as a tardb viammseqs tar2db gtdb.tar.gz tardb --tar-include 'faa.gz$'
- Run
mmseqs createdb tardb gtdb
mmseqs createdb
finishes without error, but the resulting gtdb.lookup
file seems to assign the wrong genome IDs (resulting in ID collisions) for any genomes with an ID over 65535. Meanwhile, the correct IDs seem to be assigned in gtdb.source
.
For example:
Last few lines of gtdb.source
(tail gtdb.source
):
85195 RS_GCF_014201515.1
85196 RS_GCF_003977665.1
85197 RS_GCF_007922615.2
85198 GB_GCA_002389765.1
85199 RS_GCF_001440475.1
85200 GB_GCA_015059865.1
85201 GB_GCA_900546895.1
85202 GB_GCA_017621015.1
85203 GB_GCA_015486425.1
85204 GB_GCA_934565415.1
It seems like the left column is the genome ID, and the right column is the genome name in the GTDB. Everything seems OK with this file.
Last few lines of gtdb.lookup
(tail gtdb.lookup
)
256163105 CAKTHQ010000047.1_16 19668
256163106 CAKTHQ010000050.1_5 19668
256163107 CAKTHQ010000052.1_8 19668
256163108 CAKTHQ010000056.1_1 19668
256163109 CAKTHQ010000058.1_10 19668
256163110 CAKTHQ010000061.1_12 19668
256163111 CAKTHQ010000065.1_4 19668
256163112 CAKTHQ010000070.1_6 19668
256163113 CAKTHQ010000075.1_3 19668
256163114 CAKTHQ010000082.1_2 19668
It seems like the left column is the ORF ID, the middle column is the contig name, and the right column is the genome ID.
This file seems problematic. I confirmed (via the NCBI website) that contig CAKTHQ010000082.1
(from which the ORFs were derived in gtdb.lookup
above) comes from assembly GCA_934565415.1
. This assembly is shown on the last line of gtdb.source
above and has an ID of 85204 in gtdb.source
. However, in gtdb.lookup
, ORFs on the contig seem to have the erroneous genome ID of 19668, which matches another genome in gtdb.source
. I assume that the numbers on the right side of the gtdb.lookup
shown above should be 85204.
Although not shown above, the genome ID counter seems to re-start in gtdb.lookup
(at 0) after 65535. Note, for example, that 19668 + 1 + 65535 = 85204
.
Potential causes
The issue seems to have something to do with the mmseqs createdb
step, but I have not looked into the code for this command. As noted by @mlcoleman , it's interesting that 65535 is the same as the --max-seq-len
param.
Consequences
If the default GTDB seqdb is used to make a taxdb (without correcting the gtdb.lookup
file), then taxonomy results become ambiguous or incorrect. For example, here are the taxonomies of some contigs run through mmseqs easy-taxonomy
with the default GTDB seqtaxdb:
contig_1 2 superkingdom Bacteria 77 69 59 0.850 d_Bacteria
contig_2 2 superkingdom Bacteria 66 61 54 0.890 d_Bacteria
contig_3 2 superkingdom Bacteria 54 47 45 0.960 d_Bacteria
contig_4 2 superkingdom Bacteria 53 51 47 0.920 d_Bacteria
contig_5 2 superkingdom Bacteria 52 48 44 0.910 d_Bacteria
contig_6 2 superkingdom Bacteria 57 50 46 0.930 d_Bacteria
contig_7 2 superkingdom Bacteria 59 54 52 0.980 d_Bacteria
If the gtdb.lookup
file is corrected (as described under "Possible workaround" below), the classifications match what I would expect for the input contigs. (Sorry, I don't show the corrected classification data for these contigs here because I am working with unpublished data at the moment.)
Possible workaround
As discussed by @mlcoleman , after running mmseqs createdb tardb gtdb
, I was able to fix the gtdb.lookup
file so that the genome IDs in this file corresponded to the genome IDs used in gtdb.source
. (Unlike @mherold1 , I set the genome IDs to start from 0, because this was the format used in gtdb.source
.)
To map the genome IDs in gtdb.source
to those in gtdb.lookup
, I first made a table mapping each ORF ID to each genome ID via the gtdb.tar.gz
tarball, as follows:
Create the following bash script, get_orf_names.sh
:
#!/usr/bin/env bash
# get_orf_names.sh
# Helper script to get ORF names from a gtdb_proteins_aa_reps.tar.gz file
# This script expects STDIN to be a gzipped FAA file for one genome; $1 should be the filename or path of the FAA file
# Output is a tab-separated table (no header) with two columns: genome name and ORF name
set -euo pipefail
tar_filename=$1
genome_name="${tar_filename##*/}"
genome_name="${genome_name%_protein.faa.gz}"
#zcat - | \
pigz -cd -p 8 - | \
awk -v gname="${genome_name}" \
'{ if ($0 ~ /^>/) { gsub("^>", ""); print gname "\t" $1 }}'
Give this exec permissions: chmod 755 get_orf_names.sh
.
Then run:
tar -xzf gtdb.tar.gz \
--wildcards "*_protein.faa.gz" \
--to-command './get_orf_names.sh $TAR_FILENAME' \
> genome_orf_mapping.tsv
Format of the resulting file (head genome_orf_mapping.tsv
):
RS_GCF_000970205.1 NZ_CP009512.1_1
RS_GCF_000970205.1 NZ_CP009512.1_2
RS_GCF_000970205.1 NZ_CP009512.1_3
RS_GCF_000970205.1 NZ_CP009512.1_4
RS_GCF_000970205.1 NZ_CP009512.1_5
RS_GCF_000970205.1 NZ_CP009512.1_6
RS_GCF_000970205.1 NZ_CP009512.1_7
RS_GCF_000970205.1 NZ_CP009512.1_8
RS_GCF_000970205.1 NZ_CP009512.1_9
RS_GCF_000970205.1 NZ_CP009512.1_10
You can then do a JOIN operation between the gtdb.lookup
table, the genome_orf_mapping.tsv
table, and the gtdb.source
table to link the genome IDs in gtdb.source
to gtdb.lookup
. Save an output file with the same format as gdtb.lookup
but with the genome IDs from gtdb.source
. Use this new file to replace the old gtdb.lookup
. Here is some rough python/pandas code to do this (note this is slow and uses ~100 GB of RAM):
import pandas as pd
lookup_data = pd.read_csv('gtdb.lookup', sep='\t', header=None)
lookup_data.columns = ['gene-id','gene-name','genome-id']
source_data = pd.read_csv('gtdb.source', sep='\t', header=None)
source_data.columns = ['genome-id','genome-name']
genome_orf_mapping = pd.read_csv('genome_orf_mapping.tsv', sep='\t', header=None)
genome_orf_mapping.columns = ['genome-name','gene-name']
# Perform the JOIN
lookup_data_corrected = lookup_data.rename(columns={'genome-id':'genome-id-old'})\
.merge(genome_orf_mapping, on='gene-name', how='left', validate='m:1')\
.merge(source_data, on='genome-name', how='left', validate='m:1')
lookup_data_gene_corrected.drop(columns=['genome-id-old','genome-name'])\
.to_csv('gtdb.lookup.corrected', sep='\t', header=None, index=False)
# Replace gtdb.lookup with this newly saved file
After making this fix, I could run mmseqs createtaxdb
and get a taxdb that seemed to work fine with the mmseqs easy-taxonomy
pipeline, matching my expected results (as mentioned under "Consequences" above). Maybe because I corrected gdtb.lookup
before making the taxdb, I didn't need to edit any other mapping files as mentioned by @mlcoleman .
I can't guarantee this workaround is 100% correct because I do not understand all the database files generated by mmseqs. However, the workaround seems to do the trick for my data for the time being. Hope this helps before a longer-term solution is ready.
I fixed the taxonomy issue, we were using an unsigned short for the source file names. This changes was introduced a while ago and I had forgotten that we use this mechanism in GTDB and that this number can become large.
Taxonomy should work now and thanks to @apcamargo downloads should also work (#742).
If you have an existing gtdb.lookup file, the following awk
script should fix it in theory:
awk 'BEGIN { threshold = 65535 } $3 == threshold && increment == 0 { increment = 1 } $3 == 0 && increment == 1 { increment = 0; overflow++
} { print $1"\t"$2"\t"$3 + (overflow * (threshold + 1)); }' gtdb.lookup > gtdb_fixed.lookup
mv gtdb_fixed.lookup gtdb.lookup
However, I might have gotten it also wrong. I'd recommend to redownload with the latest MMseqs2 with the source fix included.
@milot-mirdita Thanks for the quick response and for the data type fix! I tested the latest master branch, and the gtdb.lookup
file seems to be correct now after running mmseqs databases gtdb results tmp
. Taxonomy results look as expected when running mmseqs easy-taxonomy
using this updated GTDB database. (I did not test the awk code you included, but thanks for writing this.)
I think it's OK to close this issue now. Thanks again.