anvio icon indicating copy to clipboard operation
anvio copied to clipboard

[FEATURE] anvi-estimate-scg-taxonomy should accept external/internal genomes file

Open ivagljiva opened this issue 3 years ago • 1 comments

The need

If you want to estimate taxonomy on a lot of different genomes, right now the only way is to run anvi-estimate-scg-taxonomy one at a time on each individual contigs database. We need a more high-throughput way of doing things.

The solution

anvi-estimate-scg-taxonomy should accept an external genomes file (or an internal genomes one, why not) as input.

Beneficiaries

Everyone running anvi-estimate-scg-taxonomy on lots and lots of genomes.

ivagljiva avatar Mar 09 '21 17:03 ivagljiva

Hello,

There is a group of 20 people that agree with you right now. Please help.

Thanks

FlorianTrigodet avatar Apr 07 '22 16:04 FlorianTrigodet

Hello again,

There is a group of 20 (more) people that agree (again) with you right now. Plz

Danke schon

FlorianTrigodet avatar Apr 06 '24 15:04 FlorianTrigodet

Hello. Sorry for the multi-year delay on this. I implemented something in PR #2249. @FlorianTrigodet , could you test it please with your dataset and let me know how it goes?

ivagljiva avatar Apr 07 '24 12:04 ivagljiva

OMG. I was working on this literally just now (with embarrassment associated with not implementing this all this time), and SAW YOUR PR WHICH IS CRAY 😂 Looking at your code I can tell that you chose a better path. I spent way too much effort on exceptions:

diff --git a/anvio/taxonomyops/scg.py b/anvio/taxonomyops/scg.py
index a8a47090b..b1a40c633 100644
--- a/anvio/taxonomyops/scg.py
+++ b/anvio/taxonomyops/scg.py
@@ -2,7 +2,7 @@
 # -*- coding: utf-8
 """
 Classes to setup remote SCG databases in local, use local databases to affiliate SCGs in anvi'o
-contigs databases with taxon names, and estimate taxonomy for genomes and metagneomes.
+contigs databases with taxon names, and estimate taxonomy for genomes and metagenomes.
 """

 import os
@@ -21,7 +21,7 @@ import anvio.filesnpaths as filesnpaths
 from anvio.errors import ConfigError
 from anvio.dbops import ContigsDatabase, ContigsSuperclass
 from anvio.drivers.diamond import Diamond
-from anvio.genomedescriptions import MetagenomeDescriptions
+from anvio.genomedescriptions import GenomeDescriptions, MetagenomeDescriptions

 from anvio.taxonomyops import AccessionIdToTaxonomy
 from anvio.taxonomyops import TaxonomyEstimatorSingle
@@ -30,6 +30,7 @@ from anvio.taxonomyops import PopulateContigsDatabaseWithTaxonomy
 run_quiet = terminal.Run(log_file_path=None, verbose=False)
 progress_quiet = terminal.Progress(verbose=False)
 pp = terminal.pretty_print
+P = terminal.pluralize


 __author__ = "Developers of anvi'o (see AUTHORS.txt)"
@@ -200,26 +201,82 @@ class SanityCheck(object):
                                       "the parameter space of this program is like a mine field and we are very upset about it "
                                       "as well).")

-                if not self.contigs_db_path:
+                if not self.contigs_db_path and not self.external_genomes_txt:
                     raise ConfigError("For these things to work, you need to provide a contigs database for the anvi'o SCG "
-                                      "taxonomy workflow :(")
-
-                utils.is_contigs_db(self.contigs_db_path)
-
-                scg_taxonomy_was_run = ContigsDatabase(self.contigs_db_path, run=run_quiet, progress=progress_quiet).meta['scg_taxonomy_was_run']
-                scg_taxonomy_database_version = ContigsDatabase(self.contigs_db_path, run=run_quiet, progress=progress_quiet).meta['scg_taxonomy_database_version']
-                if not scg_taxonomy_was_run:
-                    raise ConfigError("It seems the SCG taxonomy tables were not populated in this contigs database :/ Luckily it "
-                                      "is easy to fix that. Please see the program `anvi-run-scg-taxonomy`.")
-
-                if scg_taxonomy_database_version != self.ctx.scg_taxonomy_database_version:
-                    self.progress.reset()
-                    raise ConfigError("The SCG taxonomy database version on your computer (%s) is different than the SCG taxonomy database "
-                                      "version to populate your contigs database (%s). Please re-run the program `anvi-run-scg-taxonomy` "
-                                      "on your contigs-db." % (self.ctx.scg_taxonomy_database_version, scg_taxonomy_database_version))
+                                      "taxonomy workflow (or an external-genomes-txt file when applicable) :(")
+
+                if self.contigs_db_path and self.external_genomes_txt:
+                    raise ConfigError("Please use either a single contigs-db, or an external-genomes file as input :(")
+
+                if self.external_genomes_txt and self.profile_db_path:
+                    raise ConfigError("When you provide an external genomes file, anvi'o assumes that you have a bunch of single "
+                                      "genmes in that file for which you wish to estimate SCG taxonomy. In that case there is no "
+                                      "need for a profile database, but you have initiated this process with one. Anvi'o is "
+                                      "confuse. Please remove the profile-db from your request.")
+
+                # since the user can either initiate this class with a contigs-db or an external genomes file,
+                # we need to do the following sanity checks a bit more tactfully.
+                if self.contigs_db_path:
+                    # working with a single contigs-db file
+                    contigs_db = ContigsDatabase(self.contigs_db_path, run=run_quiet, progress=progress_quiet)
+                    scg_taxonomy_was_run = contigs_db.meta['scg_taxonomy_was_run']
+                    scg_taxonomy_database_version = contigs_db.meta['scg_taxonomy_database_version']
+                    contigs_db.disconnect()
+
+                    if not scg_taxonomy_was_run:
+                        raise ConfigError("It seems the SCG taxonomy tables were not populated in this contigs database :/ Luckily it "
+                                          "is easy to fix that. Please see the program `anvi-run-scg-taxonomy`.")
+
+                    if scg_taxonomy_database_version != self.ctx.scg_taxonomy_database_version:
+                        self.progress.reset()
+                        raise ConfigError("The SCG taxonomy database version on your computer (%s) is different than the SCG taxonomy database "
+                                          "version to populate your contigs database (%s). Please re-run the program `anvi-run-scg-taxonomy` "
+                                          "on your contigs-db." % (self.ctx.scg_taxonomy_database_version, scg_taxonomy_database_version))
+
+                    if self.profile_db_path:
+                        utils.is_profile_db_and_contigs_db_compatible(self.profile_db_path, self.contigs_db_path)
+                elif self.external_genomes_txt:
+                    # working with an external-genomes file
+                    ext = GenomeDescriptions(self.args)
+                    ext.read_genome_paths_from_input_files()
+
+                    for project_name in ext.external_genomes_dict:
+                        contigs_db = ContigsDatabase(ext.external_genomes_dict[project_name]['contigs_db_path'], run=run_quiet, progress=progress_quiet)
+                        ext.external_genomes_dict[project_name]['scg_taxonomy_was_run'] = contigs_db.meta['scg_taxonomy_was_run']
+                        ext.external_genomes_dict[project_name]['scg_taxonomy_database_version'] = contigs_db.meta['scg_taxonomy_database_version']
+                        contigs_db.disconnect()
+
+                    genomes_with_no_scg_taxonomy_data = [g for g in ext.external_genomes_dict if not ext.external_genomes_dict[g]['scg_taxonomy_was_run']]
+                    if len(genomes_with_no_scg_taxonomy_data):
+                        if len(genomes_with_no_scg_taxonomy_data) == len(ext.external_genomes_dict):
+                            raise ConfigError(f"Bad news. It seems none of the {len(ext.external_genomes_dict)} genomes in your external genomes file seem to have "
+                                              f"have been populated with SCG taxonomy data. Please run the `anvi-run-scg-taxonomy` program on all your genomes.")
+                        else:
+                            raise ConfigError(f"Bad news. {len(genomes_with_no_scg_taxonomy_data)} of your {len(ext.external_genomes_dict)} genomes in your "
+                                              f"external genomes file {P('does', len(genomes_with_no_scg_taxonomy_data), alt='do')} not seem to have been "
+                                              f"populated with SCG taxonomy data. Luckily it is easy to fix that: just run the `anvi-run-scg-taxonomy` program "
+                                              f"on these contigs-db files. Here is a list of those that cause the problem here: "
+                                              f"{', '.join(genomes_with_no_scg_taxonomy_data)}")
+
+                    genomes_with_bad_scg_taxonomy_db_version = [g for g in ext.external_genomes_dict if ext.external_genomes_dict[g]['scg_taxonomy_database_version'] != self.ctx.scg_taxonomy_database_version]
+                    if len(genomes_with_bad_scg_taxonomy_db_version):
+                        if len(genomes_with_bad_scg_taxonomy_db_version) == len(ext.external_genomes_dict):
+                            raise ConfigError(f"Bad news. It seems none of the {len(ext.external_genomes_dict)} genomes in your external genomes file seem to have "
+                                              f"the right database version compared to what you ahve on your computer. Please rerun the program `anvi-run-scg-taxonomy` "
+                                              f"on your genomes.")
+                        else:
+                            raise ConfigError(f"Bad news. {len(genomes_with_bad_scg_taxonomy_db_version)} of your {len(ext.external_genomes_dict)} genomes in your "
+                                              f"external genomes file {P('does', len(genomes_with_bad_scg_taxonomy_db_version), alt='do')} not seem to have the "
+                                              f"right SCG taxonomy database version compared to what you ahve on your computer. To fix that you can simply re-run the "
+                                              f"program `anvi-run-scg-taxonomy` on them. Here is a list of genomes that cause the problem here: "
+                                              f"{', '.join(genomes_with_bad_scg_taxonomy_db_version)}")
+                else:
+                    # working with poop
+                    raise ConfigError("Anvi'o needs an adult :/")

-                if self.profile_db_path:
-                    utils.is_profile_db_and_contigs_db_compatible(self.profile_db_path, self.contigs_db_path)
+                # the rest of the sanity checks continue with no attention to whether we are working with a single contigs-db or
+                # an external genomes file. most of these sanity checks are not applicable for the external-genoems-txt case
+                # anyway since the user will get an error if they provilde an external-genomes-txt as well as a profile-db and so on

                 if self.collection_name and not self.profile_db_path:
                     raise ConfigError("If you are asking anvi'o to estimate taxonomy using a collection, you must also provide "
@@ -986,6 +1043,7 @@ class SCGTaxonomyEstimatorSingle(SCGTaxonomyArgs, SanityCheck, TaxonomyEstimator
         self.contigs_db_path = A('contigs_db')
         self.profile_db_path = A('profile_db')
         self.collection_name = A('collection_name')
+        self.external_genomes_txt = A('external_genomes')
         self.update_profile_db_with_taxonomy = A('update_profile_db_with_taxonomy')
         self.bin_id = A('bin_id')
         self.name = A('name')
diff --git a/bin/anvi-estimate-scg-taxonomy b/bin/anvi-estimate-scg-taxonomy
index cf9dbda48..be905e8d1 100755
--- a/bin/anvi-estimate-scg-taxonomy
+++ b/bin/anvi-estimate-scg-taxonomy
@@ -49,6 +49,7 @@ if __name__ == '__main__':
                                                     the contigs database represents a single genome. If the contigs database is actually\
                                                     a metagenome, you should use the `--metagenome` flag to explicitly declare that.")
     groupA.add_argument(*anvio.A('contigs-db'), **anvio.K('contigs-db', {'required': False}))
+    groupA.add_argument(*anvio.A('external-genomes'), **anvio.K('external-genomes', {'required': False}))
     groupA.add_argument(*anvio.A('metagenome-mode'), **anvio.K('metagenome-mode'))

     groupB = parser.add_argument_group('INPUT #2', "In addition, you can also point out a profile database. In which case you also must\

I'll stash mine and check your PR now :)

meren avatar Apr 07 '24 12:04 meren

Your PR looks lovely, @ivagljiva, and I left a minor comment, but I think it is good to go if you don't want to deal with that. Probably we can solve it if @FlorianTrigodet casts yet another peerus pressureus spell from a workshop somewhere 😂 This made me drop everything today and work on this:

image

meren avatar Apr 07 '24 13:04 meren