sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

sketch names in GTDB database use NCBI taxonomy

Open ccbaumler opened this issue 1 year ago • 20 comments

I am working on this pangenome database idea at https://github.com/ctb/2022-database-covers/. I may have found an error in the taxonomic classification while trying to figure some stuff out with the scripts.

For example, from the gtdb-rs214.lineages.csv:

GCA_000398885.1,f,d__Bacteria,p__Pseudomonadota,c__Gammaproteobacteria,o__Enterobacterales,f__Enterobacteriaceae,g__Escherichia,s__Escherichia ruysiae

When I extract that signature from the GTDB db:

== This is sourmash version 4.8.5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

---
signature filename: /home/baumlerc/2022-database-covers/ruysiae-or-coli.zip
signature: GCA_000398885.1 Escherichia coli KTE33 strain=KTE33, Esch_coli_KTE33_V1
source file: /dev/fd/63
md5: 5fa7a957bf01b4a8bfbc2b3265e79d34
k=21 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=0
size: 4489
sum hashes: 4489
signature license: CC0

loaded 1 signatures total, from 1 files

The GenBank link implies that this should be considered E. coli https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000398885.1/

ccbaumler avatar Feb 16 '24 20:02 ccbaumler

huh. Thanks for finding, @ccbaumler! We used make-gtdb-taxonomy.py file, which you can see here: https://github.com/sourmash-bio/database-releases/pull/3/files#diff-f2b14d2992941e5da6adff7c90f2d2ed91b1cca96a5a84ee8ca0e6e92feb03d7

https://github.com/sourmash-bio/database-releases/pull/3

bluegenes avatar Feb 16 '24 21:02 bluegenes

they probably changed it. they do that.

ctb avatar Feb 16 '24 22:02 ctb

I can rerun the script to create the lineage spreadsheet, but I don't see how a GenBank update could affect the GTDB and lineage file. We created those at the same time and they should be accurate to each other. The only way GenBank update could affect this is if we recreated one file and not the other.

ccbaumler avatar Feb 16 '24 23:02 ccbaumler

wait I don't understand. apologies if I'm missing something obvious 😓

GTDB is distinct from GenBank taxonomy. They use the same genome accessions, but the taxonomies are intentionally not concordant.

when you say "the genbank link implies..." that is something that genbank changes from time to time.

ctb avatar Feb 16 '24 23:02 ctb

here, the genbank link matches what's in the GTDB zipfile, which is an assembly record named based on the Genbank classification.

the GTDB folk have decided it's a different Escherichia, and the taxonomy file for GTDB assigns that GenBank access to s__Escherichia ruysiae.

Probably the most confusing thing I see here is that we did not actually rename the sketch in the zip file to match the GTDB classification, but kept it as the GenBank name... computers are hard 😭

ctb avatar Feb 16 '24 23:02 ctb

Thank you for walking through everything! I see my own misunderstanding clearly now.

Would it be best to update the signature name with the species rank name from the lineage spreadsheet and a unique identifier?

"Computers are hard." - Titus

ccbaumler avatar Feb 17 '24 00:02 ccbaumler

Would it be best to update the signature name with the species rank name from the lineage spreadsheet and a unique identifier?

sounds like work

ctb avatar Feb 17 '24 00:02 ctb

It's this kind of wisdom that keeps me showing up day after day.

ccbaumler avatar Feb 17 '24 00:02 ccbaumler

so, some slightly more sober thoughts ;).

I agree that it's confusing that when you find matches with search/gather in a database named 'gtdb', you get the NCBI genome names, which basically use taxonomic names from NCBI.

The options to fix this would appear to be, roughly -

  • rename the sketches in the gtdb databases to match the GTDB taxonomy. This is certainly do-able, but breaks the principle that we're sketching the genomes/using the genome names as they are given to us by the source - which is NCBI. It might make tracking down the source of genome sketches just one step harder, too. Here it helps that would leave the accessions the same, since GTDB uses the same accession information as GenBank.
  • remove the names altogether and just go with accession. This would be annoying to me (and presumably other users) because I like seeing slightly more informative output in sourmash gather and so on.
  • leave it as it is - no work needs to be done. Certainly the easiest thing to do :).

I'm tempted to leave this issue open (maybe changing the issue title), or maybe create a new one, and then see what we feel like doing the next time GTDB does a release and we need to update.

I agree it is mildly confusing as it is, but we've been doing this for years and you're the first person who has noticed - kudos on paying attention 😆 ! - so maybe it's a minor confusion in a tapestry of many larger confusions? :). As it is, GTDB mostly agrees with NCBI taxonomy at the family/genus level, too, so it's not that jarring a disconnect in practice.

ctb avatar Feb 17 '24 15:02 ctb

I think if we add more info into the signature name and discuss it in the documentation, it would alleviate any confusion. It could also show in the output that your return tax has some contradictions to investigate as well.

Last night I was looking through the script @bluegenes linked and the metadata files that were used to create the lineage files. I think we could update the sig names throughout the database to include the gtdb_genome_representative accession with the gtdb_taxonomy species name. Something like:

== This is sourmash version 4.8.5. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

---
signature filename: /home/baumlerc/2022-database-covers/ruysiae-or-coli.zip

signature: GCA_000398885.1 Escherichia coli KTE33 strain=KTE33, Esch_coli_KTE33_V1 (GTDB_GCA_021307345.1 GTDB_Escherichia ruysiae)

source file: /dev/fd/63
md5: 5fa7a957bf01b4a8bfbc2b3265e79d34
k=21 molecule=DNA num=0 scaled=1000 seed=42 track_abundance=0
size: 4489
sum hashes: 4489
signature license: CC0

loaded 1 signatures total, from 1 files

Appendix

First few columns of bac120_metadata_r214.tsv:

accession ambiguous_bases checkm_completeness checkm_contamination checkm_marker_count checkm_marker_lineage checkm_marker_set_count checkm_strain_heterogeneity coding_bases coding_density contig_count gc_count gc_percentage genome_size gtdb_genome_representative gtdb_representative gtdb_taxonomy
GB_GCA_000398885.1 0 98.61 0.04 1173 f__Enterobacteriaceae (UID5124) 336 0 3884601 84.7628338874941 220 2275196 50.6641441436627 4582906 GB_GCA_021307345.1 f d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia ruysiae

The two columns in bac120_taxonomy_r214.tsv:

GB_GCA_021307345.1 d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia ruysiae

ccbaumler avatar Feb 17 '24 16:02 ccbaumler

Back when I added NCBI genome calculation to wort I started building the name for the sig using these fields from assembly_summary_genbank.txt. At that point in time the only place we had to save this metadata was inside the signature, and it was this name that was pulled when reporting on gather (and was the solution for this item from Titus list:

remove the names altogether and just go with accession. This would be annoying to me (and presumably other users) because I like seeing slightly more informative output in sourmash gather and so on.

There is a whole discussion to be had on splitting "fast-changing" metadata (taxonomy, names) from "slow-changing" metadata/data (accession ID, the actual hashes in the minhash), and "name" is one of these fields that fall more in "fast" than "slow". But it is sort of the only one we have at the signature level.

But, turns out we have a better solution for collections nowadays: manifests!

We can add an extra column to the manifest for "preferred"/"overwrite" name, and use that (if present) when outputting in gather/search/any other place reporting names. Manifest are easy to change/update, don't mess with the signatures (so we can reuse NCBI sigs for GTDB, for example), and fit nicely into the "fast"/"slow" divide above.

luizirber avatar Feb 17 '24 17:02 luizirber

We can add an extra column to the manifest for "preferred"/"overwrite" name, and use that (if present) when outputting in gather/search/any other place reporting names. Manifest are easy to change/update, don't mess with the signatures (so we can reuse NCBI sigs for GTDB, for example), and fit nicely into the "fast"/"slow" divide above.

👍

ctb avatar Feb 17 '24 17:02 ctb

Thanks for all the great explanations, ideas and details!

One point I'm having trouble rectifying in my mind is the making signatures and taxonomies from GTDB data but naming the signature as NCBI taxonomic name. Intuitively, I would expect the sig "name" to be the GTDB taxonomic name.

ccbaumler avatar Feb 17 '24 18:02 ccbaumler

Separation of name and taxonomy! A sourmash superpower (in many ways) is that we can key multiple different taxonomies on the same accession. But the flip side is that the name (which contains the accession plus some human-readable text) is distinct from taxonomy.

(IMO the content is what matters mostly, anyway. Names are ephemeral. Taxonomies change. Content is (closer to) eternal!

ctb avatar Feb 17 '24 18:02 ctb

Truth, the representative genomic signature is the most important part of the database. While names are ephemeral, we are still using them. My understanding is we are using them in contradictory ways by making the database name and the signature names from two different sources.

And, for taxonomic profiling, would it be more intuitive for users reading their gather output for the database to be called NCBI instead of GTDB because the signature names derive from the NCBI and not the GTDB?

Then when using alternate taxonomic naming databases, they would expect to see some opposing taxonomies... ?

ccbaumler avatar Feb 17 '24 19:02 ccbaumler

And, for taxonomic profiling, would it be more intuitive for users reading their gather output for the database to be called NCBI instead of GTDB because the signature names derive from the NCBI and not the GTDB?

but 'gather' doesn't produce taxonomic output... tax metagenome does.

What we'd really want to call the database is something like gtdb-subset-of-ncbi. Which feels likely to be even more confusing.

ctb avatar Feb 17 '24 23:02 ctb

Please correct me if I misunderstood your comment about liking to see the output. I was referring to the fifth column of gather which returns the signature names of the database against the query.

If I use the GTDB sourmash database but it is returning NCBI taxonomies in that output... I don't know. Doesn't seem to be presenting the output intuitively in my mind. Especially when the output could differ from GTDB like the initial comment on this thread points out.

ccbaumler avatar Feb 17 '24 23:02 ccbaumler

yes, I understand.

someone needs to propose a specific solution, and then someone needs to do the work. The person doing the work generally gets a large say in which solution is chosen :).

ctb avatar Feb 18 '24 00:02 ctb

We can add an extra column to the manifest for "preferred"/"overwrite" name, and use that (if present) when outputting in gather/search/any other place reporting names. Manifest are easy to change/update, don't mess with the signatures (so we can reuse NCBI sigs for GTDB, for example), and fit nicely into the "fast"/"slow" divide above.

After talking through this with @ccbaumler a bit, I am endorsing this solution a bit more strongly -

  • renaming things is something we want to be able to do! adding an override name column to manifests strikes me as a good minimal change that enables that easily, and is also debuggable (as opposed to simply updating the name in a manifest and using that as a default override, which ...is not very explicit and hence not very debuggable)
  • if we do that, we should also consider adding a 'deprecated' boolean column per https://github.com/sourmash-bio/sourmash/issues/985#issuecomment-1065907568

ctb avatar Feb 23 '24 15:02 ctb

if we're going to upgrade manifest contents to include deprecated/removed and preferred_name, we could also:

  • think about deprecating filename in manifests, 'cause it's dumb
  • add missing fields like seed and license per https://github.com/sourmash-bio/sourmash/issues/1849
  • maybe add a notes field

this would (in my opinionated opinion) be distinct from adding other metadata fields, tags, and so on, as explored in my comments on https://github.com/sourmash-bio/branchwater/issues/18

ctb avatar Feb 23 '24 16:02 ctb