anvio
anvio copied to clipboard
[FEATURE] anvi-estimate-scg-taxonomy should accept external/internal genomes file
The need
If you want to estimate taxonomy on a lot of different genomes, right now the only way is to run anvi-estimate-scg-taxonomy
one at a time on each individual contigs database. We need a more high-throughput way of doing things.
The solution
anvi-estimate-scg-taxonomy
should accept an external genomes file (or an internal genomes one, why not) as input.
Beneficiaries
Everyone running anvi-estimate-scg-taxonomy
on lots and lots of genomes.
Hello,
There is a group of 20 people that agree with you right now. Please help.
Thanks
Hello again,
There is a group of 20 (more) people that agree (again) with you right now. Plz
Danke schon
Hello. Sorry for the multi-year delay on this. I implemented something in PR #2249. @FlorianTrigodet , could you test it please with your dataset and let me know how it goes?
OMG. I was working on this literally just now (with embarrassment associated with not implementing this all this time), and SAW YOUR PR WHICH IS CRAY 😂 Looking at your code I can tell that you chose a better path. I spent way too much effort on exceptions:
diff --git a/anvio/taxonomyops/scg.py b/anvio/taxonomyops/scg.py
index a8a47090b..b1a40c633 100644
--- a/anvio/taxonomyops/scg.py
+++ b/anvio/taxonomyops/scg.py
@@ -2,7 +2,7 @@
# -*- coding: utf-8
"""
Classes to setup remote SCG databases in local, use local databases to affiliate SCGs in anvi'o
-contigs databases with taxon names, and estimate taxonomy for genomes and metagneomes.
+contigs databases with taxon names, and estimate taxonomy for genomes and metagenomes.
"""
import os
@@ -21,7 +21,7 @@ import anvio.filesnpaths as filesnpaths
from anvio.errors import ConfigError
from anvio.dbops import ContigsDatabase, ContigsSuperclass
from anvio.drivers.diamond import Diamond
-from anvio.genomedescriptions import MetagenomeDescriptions
+from anvio.genomedescriptions import GenomeDescriptions, MetagenomeDescriptions
from anvio.taxonomyops import AccessionIdToTaxonomy
from anvio.taxonomyops import TaxonomyEstimatorSingle
@@ -30,6 +30,7 @@ from anvio.taxonomyops import PopulateContigsDatabaseWithTaxonomy
run_quiet = terminal.Run(log_file_path=None, verbose=False)
progress_quiet = terminal.Progress(verbose=False)
pp = terminal.pretty_print
+P = terminal.pluralize
__author__ = "Developers of anvi'o (see AUTHORS.txt)"
@@ -200,26 +201,82 @@ class SanityCheck(object):
"the parameter space of this program is like a mine field and we are very upset about it "
"as well).")
- if not self.contigs_db_path:
+ if not self.contigs_db_path and not self.external_genomes_txt:
raise ConfigError("For these things to work, you need to provide a contigs database for the anvi'o SCG "
- "taxonomy workflow :(")
-
- utils.is_contigs_db(self.contigs_db_path)
-
- scg_taxonomy_was_run = ContigsDatabase(self.contigs_db_path, run=run_quiet, progress=progress_quiet).meta['scg_taxonomy_was_run']
- scg_taxonomy_database_version = ContigsDatabase(self.contigs_db_path, run=run_quiet, progress=progress_quiet).meta['scg_taxonomy_database_version']
- if not scg_taxonomy_was_run:
- raise ConfigError("It seems the SCG taxonomy tables were not populated in this contigs database :/ Luckily it "
- "is easy to fix that. Please see the program `anvi-run-scg-taxonomy`.")
-
- if scg_taxonomy_database_version != self.ctx.scg_taxonomy_database_version:
- self.progress.reset()
- raise ConfigError("The SCG taxonomy database version on your computer (%s) is different than the SCG taxonomy database "
- "version to populate your contigs database (%s). Please re-run the program `anvi-run-scg-taxonomy` "
- "on your contigs-db." % (self.ctx.scg_taxonomy_database_version, scg_taxonomy_database_version))
+ "taxonomy workflow (or an external-genomes-txt file when applicable) :(")
+
+ if self.contigs_db_path and self.external_genomes_txt:
+ raise ConfigError("Please use either a single contigs-db, or an external-genomes file as input :(")
+
+ if self.external_genomes_txt and self.profile_db_path:
+ raise ConfigError("When you provide an external genomes file, anvi'o assumes that you have a bunch of single "
+ "genmes in that file for which you wish to estimate SCG taxonomy. In that case there is no "
+ "need for a profile database, but you have initiated this process with one. Anvi'o is "
+ "confuse. Please remove the profile-db from your request.")
+
+ # since the user can either initiate this class with a contigs-db or an external genomes file,
+ # we need to do the following sanity checks a bit more tactfully.
+ if self.contigs_db_path:
+ # working with a single contigs-db file
+ contigs_db = ContigsDatabase(self.contigs_db_path, run=run_quiet, progress=progress_quiet)
+ scg_taxonomy_was_run = contigs_db.meta['scg_taxonomy_was_run']
+ scg_taxonomy_database_version = contigs_db.meta['scg_taxonomy_database_version']
+ contigs_db.disconnect()
+
+ if not scg_taxonomy_was_run:
+ raise ConfigError("It seems the SCG taxonomy tables were not populated in this contigs database :/ Luckily it "
+ "is easy to fix that. Please see the program `anvi-run-scg-taxonomy`.")
+
+ if scg_taxonomy_database_version != self.ctx.scg_taxonomy_database_version:
+ self.progress.reset()
+ raise ConfigError("The SCG taxonomy database version on your computer (%s) is different than the SCG taxonomy database "
+ "version to populate your contigs database (%s). Please re-run the program `anvi-run-scg-taxonomy` "
+ "on your contigs-db." % (self.ctx.scg_taxonomy_database_version, scg_taxonomy_database_version))
+
+ if self.profile_db_path:
+ utils.is_profile_db_and_contigs_db_compatible(self.profile_db_path, self.contigs_db_path)
+ elif self.external_genomes_txt:
+ # working with an external-genomes file
+ ext = GenomeDescriptions(self.args)
+ ext.read_genome_paths_from_input_files()
+
+ for project_name in ext.external_genomes_dict:
+ contigs_db = ContigsDatabase(ext.external_genomes_dict[project_name]['contigs_db_path'], run=run_quiet, progress=progress_quiet)
+ ext.external_genomes_dict[project_name]['scg_taxonomy_was_run'] = contigs_db.meta['scg_taxonomy_was_run']
+ ext.external_genomes_dict[project_name]['scg_taxonomy_database_version'] = contigs_db.meta['scg_taxonomy_database_version']
+ contigs_db.disconnect()
+
+ genomes_with_no_scg_taxonomy_data = [g for g in ext.external_genomes_dict if not ext.external_genomes_dict[g]['scg_taxonomy_was_run']]
+ if len(genomes_with_no_scg_taxonomy_data):
+ if len(genomes_with_no_scg_taxonomy_data) == len(ext.external_genomes_dict):
+ raise ConfigError(f"Bad news. It seems none of the {len(ext.external_genomes_dict)} genomes in your external genomes file seem to have "
+ f"have been populated with SCG taxonomy data. Please run the `anvi-run-scg-taxonomy` program on all your genomes.")
+ else:
+ raise ConfigError(f"Bad news. {len(genomes_with_no_scg_taxonomy_data)} of your {len(ext.external_genomes_dict)} genomes in your "
+ f"external genomes file {P('does', len(genomes_with_no_scg_taxonomy_data), alt='do')} not seem to have been "
+ f"populated with SCG taxonomy data. Luckily it is easy to fix that: just run the `anvi-run-scg-taxonomy` program "
+ f"on these contigs-db files. Here is a list of those that cause the problem here: "
+ f"{', '.join(genomes_with_no_scg_taxonomy_data)}")
+
+ genomes_with_bad_scg_taxonomy_db_version = [g for g in ext.external_genomes_dict if ext.external_genomes_dict[g]['scg_taxonomy_database_version'] != self.ctx.scg_taxonomy_database_version]
+ if len(genomes_with_bad_scg_taxonomy_db_version):
+ if len(genomes_with_bad_scg_taxonomy_db_version) == len(ext.external_genomes_dict):
+ raise ConfigError(f"Bad news. It seems none of the {len(ext.external_genomes_dict)} genomes in your external genomes file seem to have "
+ f"the right database version compared to what you ahve on your computer. Please rerun the program `anvi-run-scg-taxonomy` "
+ f"on your genomes.")
+ else:
+ raise ConfigError(f"Bad news. {len(genomes_with_bad_scg_taxonomy_db_version)} of your {len(ext.external_genomes_dict)} genomes in your "
+ f"external genomes file {P('does', len(genomes_with_bad_scg_taxonomy_db_version), alt='do')} not seem to have the "
+ f"right SCG taxonomy database version compared to what you ahve on your computer. To fix that you can simply re-run the "
+ f"program `anvi-run-scg-taxonomy` on them. Here is a list of genomes that cause the problem here: "
+ f"{', '.join(genomes_with_bad_scg_taxonomy_db_version)}")
+ else:
+ # working with poop
+ raise ConfigError("Anvi'o needs an adult :/")
- if self.profile_db_path:
- utils.is_profile_db_and_contigs_db_compatible(self.profile_db_path, self.contigs_db_path)
+ # the rest of the sanity checks continue with no attention to whether we are working with a single contigs-db or
+ # an external genomes file. most of these sanity checks are not applicable for the external-genoems-txt case
+ # anyway since the user will get an error if they provilde an external-genomes-txt as well as a profile-db and so on
if self.collection_name and not self.profile_db_path:
raise ConfigError("If you are asking anvi'o to estimate taxonomy using a collection, you must also provide "
@@ -986,6 +1043,7 @@ class SCGTaxonomyEstimatorSingle(SCGTaxonomyArgs, SanityCheck, TaxonomyEstimator
self.contigs_db_path = A('contigs_db')
self.profile_db_path = A('profile_db')
self.collection_name = A('collection_name')
+ self.external_genomes_txt = A('external_genomes')
self.update_profile_db_with_taxonomy = A('update_profile_db_with_taxonomy')
self.bin_id = A('bin_id')
self.name = A('name')
diff --git a/bin/anvi-estimate-scg-taxonomy b/bin/anvi-estimate-scg-taxonomy
index cf9dbda48..be905e8d1 100755
--- a/bin/anvi-estimate-scg-taxonomy
+++ b/bin/anvi-estimate-scg-taxonomy
@@ -49,6 +49,7 @@ if __name__ == '__main__':
the contigs database represents a single genome. If the contigs database is actually\
a metagenome, you should use the `--metagenome` flag to explicitly declare that.")
groupA.add_argument(*anvio.A('contigs-db'), **anvio.K('contigs-db', {'required': False}))
+ groupA.add_argument(*anvio.A('external-genomes'), **anvio.K('external-genomes', {'required': False}))
groupA.add_argument(*anvio.A('metagenome-mode'), **anvio.K('metagenome-mode'))
groupB = parser.add_argument_group('INPUT #2', "In addition, you can also point out a profile database. In which case you also must\
I'll stash mine and check your PR now :)
Your PR looks lovely, @ivagljiva, and I left a minor comment, but I think it is good to go if you don't want to deal with that. Probably we can solve it if @FlorianTrigodet casts yet another peerus pressureus spell from a workshop somewhere 😂 This made me drop everything today and work on this: