MMseqs2 icon indicating copy to clipboard operation
MMseqs2 copied to clipboard

gtdb taxonomy build

Open mlcoleman opened this issue 1 year ago • 6 comments

Expected Behavior

map queries to GTDB then get taxonomy lca of best hits

Current Behavior

-all taxonomy results are "unclassified" when mapping to GTDB -the gtdb_mapping file has only a single column (0..256163114) which may be the problem -I think the gtdb.lookup table may be wrong.

Steps to Reproduce (for bugs)

-download gtdb.tar.gz mmseqs createdb tardb gtdb -execute shell commands in databases.sh to create nodes.dmp, names.dmp, mapping_genomes (these files all seem to be fine) mmseqs createtaxdb gtdb tmp/taxdb --ncbi-tax-dump tmp/taxonomy --tax-mapping-file "tmp/taxonomy/mapping_genomes" --tax-mapping-mode 1

MMseqs Output (for bugs)

-no errors or warnings

createtaxdb gtdb tmp/taxdb --ncbi-tax-dump tmp/taxonomy --tax-mapping-file tmp/taxonomy/mapping_genomes --tax-mapping-mode 1 

MMseqs Version:        	14.7e284
NCBI tax dump directory	tmp/taxonomy
Taxonomy mapping file  	tmp/taxonomy/mapping_genomes
Taxonomy mapping mode  	1
Taxonomy db mode       	1
Threads                	28
Verbosity              	3

if I understand correctly, the gtdb.lookup file should have 3 columns: proteinNumber (0.. 256163114), proteinID, sourceGenomeNumber :

0       NZ_CP009512.1_1 1
1       NZ_CP009512.1_33        1
2       NZ_CP009512.1_65        1
3       NZ_CP009512.1_97        1
4       NZ_CP009512.1_129       1
5       NZ_CP009512.1_161       1
6       NZ_CP009512.1_193       1
7       NZ_CP009512.1_225       1
8       NZ_CP009512.1_257       1
9       NZ_CP009512.1_289       1
10      NZ_CP009512.1_321       1

however, my gtdb.lookup file has entries from different genomes pointing to the same source genome number. e.g.

0       NZ_CP009512.1_1 1
100     NZ_CP009512.1_3201      1
101     NZ_CP009512.1_3233      1
102180795       NZ_JAAOIZ010000001.1_5  1  #should point to source genome 65537
102180796       NZ_JAAOIZ010000001.1_37 1  #should point to source genome 65537
102181002       JAJXZV010000318.1_2     2  #should point to source genome 65538
102181003       JAJXZV010000021.1_4     2  #should point to source genome 65538
104066383       NC_007181.1_8   2
104066384       NC_007181.1_40  2

the highest source genome number found in the gtdb.lookup file is 65535, and then it seems to start over at 1 duplicating ids...?

Your Environment

MMseqs2 Version: 14.7e284 tried with both latest binary download and conda install Linux my.local 3.10.0-1160.76.1.el7.x86_64 #1 SMP Wed Aug 10 16:21:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

mlcoleman avatar Jul 12 '23 17:07 mlcoleman

hmm, I just noticed that the --max-seq-len search parameter is 65535...

mlcoleman avatar Jul 16 '23 01:07 mlcoleman

fyi in case anyone else is having this problem... I was able to successfully run the GTDB taxonomy after manually re-making the gtdb.lookup file and a gtdb.taxid.mapping file, and running createtaxdb with this taxid mapping file.

mlcoleman avatar Jul 16 '23 01:07 mlcoleman

@mlcoleman I have the same problem after executing the shell commands in download.sh manually. For the gtdb.lookup file, did you just fix the third column to not start at 1 again after 65535? How did you create and use the gtdb.taxid.mapping file (do you mean adding the tax-ids to gtdb_mapping which has only 1 column now?): I would make a protein-id vs genome table and then (fasta headers from protein_faa_reps) and merge with the proteins IDs in the new gtdb.lookup and the tax-ids in tmp/taxonomy/mapping_genomes edit: I renumbered the genome ids in gtdb.lookup to go from 1 to 85205 and then replaced the gtdb_mapping file with one that links each protein-id (column1) to the tax-id of the corresponding genome (column2), tax-ids are from taxonomy/mapping_genomes. I hope that is what it is supposed to be

mherold1 avatar Jul 24 '23 14:07 mherold1

I encountered the same issue while working with mmseqs version 14-7e284 and GTDB r214, using a Linux server (Ubuntu 22.04 LTS; CPU with SSE/AVX2 support and 512 GB RAM). Although there is a lot of overlap with the post by @mlcoleman previously and the discussion by @mherold1 , I include some details below about the problems I ran into and how I addressed them.

Problem description

After making a seqdb for the GTDB, there seems to be a collision of genome IDs between different genomes in the seqdb's gdtb.lookup file. As mentioned by @mlcoleman , these ID collisions seem to disrupt mmseqs taxonomy workflows downstream.

Steps to reproduce

  1. Download gtdb.tar.gz as done in the data/workflow/databases.sh script (or you can just run: wget -o gdtb.tar.gz https://data.ace.uq.edu.au/public/gtdb/data/releases/latest/genomic_files_reps/gtdb_proteins_aa_reps.tar.gz)
  2. Import gtdb.tar.gz as a tardb via mmseqs tar2db gtdb.tar.gz tardb --tar-include 'faa.gz$'
  3. Run mmseqs createdb tardb gtdb

mmseqs createdb finishes without error, but the resulting gtdb.lookup file seems to assign the wrong genome IDs (resulting in ID collisions) for any genomes with an ID over 65535. Meanwhile, the correct IDs seem to be assigned in gtdb.source.

For example:

Last few lines of gtdb.source (tail gtdb.source):

85195   RS_GCF_014201515.1
85196   RS_GCF_003977665.1
85197   RS_GCF_007922615.2
85198   GB_GCA_002389765.1
85199   RS_GCF_001440475.1
85200   GB_GCA_015059865.1
85201   GB_GCA_900546895.1
85202   GB_GCA_017621015.1
85203   GB_GCA_015486425.1
85204   GB_GCA_934565415.1

It seems like the left column is the genome ID, and the right column is the genome name in the GTDB. Everything seems OK with this file.

Last few lines of gtdb.lookup (tail gtdb.lookup)

256163105       CAKTHQ010000047.1_16    19668
256163106       CAKTHQ010000050.1_5     19668
256163107       CAKTHQ010000052.1_8     19668
256163108       CAKTHQ010000056.1_1     19668
256163109       CAKTHQ010000058.1_10    19668
256163110       CAKTHQ010000061.1_12    19668
256163111       CAKTHQ010000065.1_4     19668
256163112       CAKTHQ010000070.1_6     19668
256163113       CAKTHQ010000075.1_3     19668
256163114       CAKTHQ010000082.1_2     19668

It seems like the left column is the ORF ID, the middle column is the contig name, and the right column is the genome ID.

This file seems problematic. I confirmed (via the NCBI website) that contig CAKTHQ010000082.1 (from which the ORFs were derived in gtdb.lookup above) comes from assembly GCA_934565415.1. This assembly is shown on the last line of gtdb.source above and has an ID of 85204 in gtdb.source. However, in gtdb.lookup, ORFs on the contig seem to have the erroneous genome ID of 19668, which matches another genome in gtdb.source. I assume that the numbers on the right side of the gtdb.lookup shown above should be 85204.

Although not shown above, the genome ID counter seems to re-start in gtdb.lookup (at 0) after 65535. Note, for example, that 19668 + 1 + 65535 = 85204.

Potential causes

The issue seems to have something to do with the mmseqs createdb step, but I have not looked into the code for this command. As noted by @mlcoleman , it's interesting that 65535 is the same as the --max-seq-len param.

Consequences

If the default GTDB seqdb is used to make a taxdb (without correcting the gtdb.lookup file), then taxonomy results become ambiguous or incorrect. For example, here are the taxonomies of some contigs run through mmseqs easy-taxonomy with the default GTDB seqtaxdb:

contig_1        2       superkingdom    Bacteria        77      69      59      0.850   d_Bacteria
contig_2        2       superkingdom    Bacteria        66      61      54      0.890   d_Bacteria
contig_3        2       superkingdom    Bacteria        54      47      45      0.960   d_Bacteria
contig_4        2       superkingdom    Bacteria        53      51      47      0.920   d_Bacteria
contig_5        2       superkingdom    Bacteria        52      48      44      0.910   d_Bacteria
contig_6        2       superkingdom    Bacteria        57      50      46      0.930   d_Bacteria
contig_7        2       superkingdom    Bacteria        59      54      52      0.980   d_Bacteria

If the gtdb.lookup file is corrected (as described under "Possible workaround" below), the classifications match what I would expect for the input contigs. (Sorry, I don't show the corrected classification data for these contigs here because I am working with unpublished data at the moment.)

Possible workaround

As discussed by @mlcoleman , after running mmseqs createdb tardb gtdb, I was able to fix the gtdb.lookup file so that the genome IDs in this file corresponded to the genome IDs used in gtdb.source. (Unlike @mherold1 , I set the genome IDs to start from 0, because this was the format used in gtdb.source.)

To map the genome IDs in gtdb.source to those in gtdb.lookup, I first made a table mapping each ORF ID to each genome ID via the gtdb.tar.gz tarball, as follows:

Create the following bash script, get_orf_names.sh:

#!/usr/bin/env bash
# get_orf_names.sh
# Helper script to get ORF names from a gtdb_proteins_aa_reps.tar.gz file
# This script expects STDIN to be a gzipped FAA file for one genome; $1 should be the filename or path of the FAA file
# Output is a tab-separated table (no header) with two columns: genome name and ORF name
set -euo pipefail

tar_filename=$1

genome_name="${tar_filename##*/}"
genome_name="${genome_name%_protein.faa.gz}"

#zcat - | \
pigz -cd -p 8 - | \
  awk -v gname="${genome_name}" \
    '{ if ($0 ~ /^>/) { gsub("^>", ""); print gname "\t" $1 }}'

Give this exec permissions: chmod 755 get_orf_names.sh.

Then run:

tar -xzf gtdb.tar.gz \
    --wildcards "*_protein.faa.gz" \
    --to-command './get_orf_names.sh $TAR_FILENAME' \
  > genome_orf_mapping.tsv

Format of the resulting file (head genome_orf_mapping.tsv):

RS_GCF_000970205.1      NZ_CP009512.1_1
RS_GCF_000970205.1      NZ_CP009512.1_2
RS_GCF_000970205.1      NZ_CP009512.1_3
RS_GCF_000970205.1      NZ_CP009512.1_4
RS_GCF_000970205.1      NZ_CP009512.1_5
RS_GCF_000970205.1      NZ_CP009512.1_6
RS_GCF_000970205.1      NZ_CP009512.1_7
RS_GCF_000970205.1      NZ_CP009512.1_8
RS_GCF_000970205.1      NZ_CP009512.1_9
RS_GCF_000970205.1      NZ_CP009512.1_10

You can then do a JOIN operation between the gtdb.lookup table, the genome_orf_mapping.tsv table, and the gtdb.source table to link the genome IDs in gtdb.source to gtdb.lookup. Save an output file with the same format as gdtb.lookup but with the genome IDs from gtdb.source. Use this new file to replace the old gtdb.lookup. Here is some rough python/pandas code to do this (note this is slow and uses ~100 GB of RAM):

import pandas as pd

lookup_data = pd.read_csv('gtdb.lookup', sep='\t', header=None)
lookup_data.columns = ['gene-id','gene-name','genome-id']

source_data = pd.read_csv('gtdb.source', sep='\t', header=None)
source_data.columns = ['genome-id','genome-name']

genome_orf_mapping = pd.read_csv('genome_orf_mapping.tsv', sep='\t', header=None)
genome_orf_mapping.columns = ['genome-name','gene-name']

# Perform the JOIN
lookup_data_corrected = lookup_data.rename(columns={'genome-id':'genome-id-old'})\
  .merge(genome_orf_mapping, on='gene-name', how='left', validate='m:1')\
  .merge(source_data, on='genome-name', how='left', validate='m:1')

lookup_data_gene_corrected.drop(columns=['genome-id-old','genome-name'])\
  .to_csv('gtdb.lookup.corrected', sep='\t', header=None, index=False)
# Replace gtdb.lookup with this newly saved file

After making this fix, I could run mmseqs createtaxdb and get a taxdb that seemed to work fine with the mmseqs easy-taxonomy pipeline, matching my expected results (as mentioned under "Consequences" above). Maybe because I corrected gdtb.lookup before making the taxdb, I didn't need to edit any other mapping files as mentioned by @mlcoleman .

I can't guarantee this workaround is 100% correct because I do not understand all the database files generated by mmseqs. However, the workaround seems to do the trick for my data for the time being. Hope this helps before a longer-term solution is ready.

jmtsuji avatar Aug 24 '23 07:08 jmtsuji

I fixed the taxonomy issue, we were using an unsigned short for the source file names. This changes was introduced a while ago and I had forgotten that we use this mechanism in GTDB and that this number can become large.

Taxonomy should work now and thanks to @apcamargo downloads should also work (#742).

If you have an existing gtdb.lookup file, the following awk script should fix it in theory:

awk 'BEGIN { threshold = 65535 } $3 == threshold && increment == 0 { increment = 1 } $3 == 0 && increment == 1 { increment = 0; overflow++
 } { print $1"\t"$2"\t"$3 + (overflow * (threshold + 1)); }' gtdb.lookup > gtdb_fixed.lookup
mv gtdb_fixed.lookup gtdb.lookup

However, I might have gotten it also wrong. I'd recommend to redownload with the latest MMseqs2 with the source fix included.

milot-mirdita avatar Aug 25 '23 12:08 milot-mirdita

@milot-mirdita Thanks for the quick response and for the data type fix! I tested the latest master branch, and the gtdb.lookup file seems to be correct now after running mmseqs databases gtdb results tmp. Taxonomy results look as expected when running mmseqs easy-taxonomy using this updated GTDB database. (I did not test the awk code you included, but thanks for writing this.)

I think it's OK to close this issue now. Thanks again.

jmtsuji avatar Aug 28 '23 06:08 jmtsuji