sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

create utility (`tax extract`?) to work with taxonomic annotation/output => picklists

Open ctb opened this issue 1 year ago • 0 comments

From https://github.com/sourmash-bio/sourmash/pull/2178,

@bluegenes:

More complicated use case that would be really neat to enable: run prefetch against, e.g. genus-level representative database. Then run gather and use the prefetch output csv as a picklist, but select all signatures in the same genera (or family, etc) as any match.

Actually, even if you were to keep tax grep as just a picklist utility, being able to scale up from matches to all members of the taxonomic group could be pretty neat.

I responded:

Right... is this kind of an inverse operation?

You would want to take a list of lineages (perhaps from a prefetch or gather file - note that sourmash tax only deals with gather files for now) and then build a taxonomy? or a picklist? that expands those matches to another level.

For example, you might:

* run gather

* annotate gather results with taxonomy using `sourmash tax annotate` => strain level

* 🪄 somehow 🪄 go from the lineages in the annotated gather file to a more general set of lineages at (say) the genus level

This strikes me as a pretty useful taxonomic utility, and points at functionality that is lacking -

* we don't really have anything that parses the annotated gather file, other than the metacoder example in #2041
  • we don't have any way to manipulate a "bulk" taxonomy file in bulk ways, e.g. "give me all of the lineages from taxfile1 that match at the genus level to the genomes/lineages in taxfile2.

... elided ...

maybe in addition to tax grep which works on a single match, we want a bulk matching function that takes in some format that links identifiers and taxonomies (annotated gather file? and/or taxonomy file?) as well as a taxonomy database, and then outputs picklists. "Promote these matches from strain to genus level" is one specific example here.

Just roughing it out,

sourmash tax extract -g gather.csv -t gtdb.csv -r genus -o picklist.csv

would take the matches in gather.csv, use the taxonomy in gtdb.csv, pull them back to genus level, and output a picklist

@bluegenes

I think tax extract would be very useful and get us to the second use case!!!

1. to select all members of specific family: `tax grep family_name` --> picklist

2. to promote prefetch matches to genus level: `tax annotate` --> `tax extract` --> picklist

Note -- If we're providing the taxonomy file to tax extract, we could even just do the tax annotate step internally to avoid needing to run an extra step.

Additional use case: use these picklists with exclude allows us to easily exclude entire taxonomic groups from search, e.g. for testing taxonomic classification.

ctb avatar Aug 07 '22 13:08 ctb