Incorrect taxonomy assignment
First of all, diamond is a great tool, thank you for building and maintaining it! I run into a weird issue while working with a custom, Genbank-derived sequence and taxonomy database, namely that the taxonomic assignment of the first non-Genbank sequence in that database is wrong. Here's the breakdown of what I'm doing:
- Download FASTA seqs of a few thousand Genbank proteomes (subsampled for broad but not deep coverage) and the complete NCBI taxonomy data.
- Append some 10K FASTA seqs from a single proteome (sequenced in-house) to the FASTA dataset from step1
- Create a custom taxon mapping file for the sequences from the custom proteome, all pointing to a particular fungal lineage from NCBI Taxonomy (nearest neighbor based on some BLAST searches). Append that information to the NCBI's prot.accession2taxid mapping downloaded in step1, and add the "accession.version taxid" header.
- Create a custom diamond DB by running
diamond makedbusing the merged sequences from step2, merged taxon mapping from step3 and complete taxon nodes and taxon names data from NCBI from step1 - Search the newly-created custom DB from step4 by running
diamond blastpusing the first sequence of the custom proteome as query.
The expected result would be the query sequence perfectly matching itself in the database, right? Well, it does, but only in terms of the hit name, evalue, bitscore and alignment stats (as compared to reference results from running a separate 2seq self-BLAST). When I try to look at the taxonomy information, either by running --outfmt 6 with staxids/sscinames or outright go for --outfmt 102 the top result points to NCBI TaxID 228908 (Nanoarchaeum equitans Kin4-M) for some reason. Of course, I double checked, and my taxon mapping for all the custom seqs points to the proper fungal lineage. Interestingly (and somewhat reassuringly) querying sequences other than the first one does report the correct fungal lineage.
Here's the fun stuff - if I remove that offending first sequence from the source FASTA and taxon mapping file and then rebuild the database from scratch, using the formerly-second-and-now-first sequence as query results in the wrong tax lineage being reported (again, hit and alignment stats are correct, but the tax information is 228908 again). This happens even though using that same query sequence (totally different from the old first seq in terms of composition and length) earlier returned the correct fungal lineage when it was the second seq in the custom proteome! Finally, to make things even weirder, if I swap the position of the first and second sequence in the source FASTA and taxon mapping file (so first is second and second is first) and then rebuild the DB, it's the first sequence again that returns the wrong tax information, even though it actually is the second seq in the proteome data. I presume this might be caused by FASTA header parsing/processing somehow taking into account internal IDs?
I am sorry this might be quite convoluted, I am happy to clarify whatever is needed. I am puzzled as to why this happens and would appreciate some insight as to a potential fix, because I am using the taxonomy information heavily in my work. (I am using diamond v2.0.15.153, if that matters).
I don't see why this should happen. Would it be possible to send me the necessary files to reproduce the problem?
Unfortunately, I cannot send you the actual files I am using because of certain restrictions, but I'll see if I can distill this into a smaller and more narrow and replicable problem using just public Genbank data and get back to you.
EDIT: What I find interesting though, is that protein sequences mapped to 228908 are the very first ones listed in the taxon mapping file, so maybe there's something about it.
@bbuchfink Ok, I think I was able to distill this issue. Here's the process:
- Download and unpack the proteome data for Saccharomyces cerevisiae - (https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/146/045/GCF_000146045.2_R64/GCF_000146045.2_R64_protein.faa.gz
- Download and unpack proteome data for Aspergillus caelatus - https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/193/585/GCF_009193585.1_Aspcae1/GCF_009193585.1_Aspcae1_protein.faa.gz
- Create taxon mapping files for each, and give them two different TaxIDs (I used 5555 for Saccharomyces and 9999 for Aspergillus, but it doesn't really matter)
- Merge the taxon mapping files, Saccharomyces goes first and Aspergillus goes second.
- Merge the proteome FASTA files, again Saccharomyces goes first and Aspergillus goes second.
- Create a diamond database using the merged FASTA file, merged taxon map and taxnames and taxnodes from the official NCBI taxonomy taxdump (https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/)
- Copy the first sequence of the Aspergillus proteome and paste it under a different title to a separate FASTA file.
- Query the database from step6 using the sequence from step7 with taxonomic output (
-outfmt 6withstaxidsor-outfmt 102)
When I just run this, the TaxID of the top hit was 5555 rather than the expected 9999. If I replace the TaxID of the first Saccharomyces sequence in the mapping file to, say, 7777, then rebuild the database and rerun the same query, it will output 7777 as the TaxID of the top hit.
I can attach the actual files (FASTAs, taxmaps, and the DB itself) since they are relatively small, if you still want them, but I thought it'd be better to delineate the process explicitly for more reproducibility.
Also, not sure if it matters, but I am running this analysis on a Mac, with Diamond installed through Conda.
When I just run this, the TaxID of the top hit was 5555 rather than the expected 9999.
I tried to follow your procedure, but I got 9999 at this step. It would be helpful if you send me your files.
When I just run this, the TaxID of the top hit was 5555 rather than the expected 9999.
I tried to follow your procedure, but I got 9999 at this step. It would be helpful if you send me your files.
Ok, should I just upload there here or email directly to you?
Either way is fine.
Sure, here are my files, should be enough I hope, let me know if you want anything else. db_test.dmnd.zip (just realized the DB file was actually build from the taxmap with the TaxID of the first entry edited to 7777) merged_map.txt.zip merged_seqs.fas.zip
and this is the outcome I get from querying the above DB with the first Aspergillus sequence (hgt is the conda env for my analyses):
(hgt) $ diamond blastp -q ./seq1.fas --db db_test.dmnd --outfmt 6 qseqid sseqid evalue bitscore staxids
test1 GCF_009193585_1_Aspcae1_protein_faa1.1 1.25e-64 192 7777
test1 GCF_009193585_1_Aspcae1_protein_faa1.1 1.25e-64 192 9999
test1 GCF_009193585_1_Aspcae1_protein_faa1.1 1.25e-64 192 9999
test1 GCF_009193585_1_Aspcae1_protein_faa1.1 1.25e-64 192 9999
I can reproduce the issue using your files. The reason is still not apparent to me, I'll need some time to track this down.
I hear ya, would it help if I attach the list of packages/versions (and/or a pip freeze) present in my conda env? For some reason I am thinking it is a MacOS thing, but could be a rogue package somewhere..
No, like I said I can reproduce the issue already (on my Linux system).
I mentioned that in case there was something on my end that resulted in a malformed DB (MacOS using different newline terminators or whatnot), but if you can recreate it and get the same result with this data then yeah, the issue seems to be host-agnostic.
Hey @bbuchfink , just wanted to check in to see if you were able to figure this out? If I can help in any way, just let me know!
Sorry, I didn't have time yet to look into it again. It could still take some time until I provide a fix.
The issue should be fixed in v2.1.1, at least it is working correctly now for the example you sent me. I'm not sure what the problem was, but the affected code was changed since then and that seems to have resolved it.
Thanks, appreciate it!