sourmash
sourmash copied to clipboard
create utility (`tax extract`?) to work with taxonomic annotation/output => picklists
From https://github.com/sourmash-bio/sourmash/pull/2178,
@bluegenes:
More complicated use case that would be really neat to enable: run prefetch against, e.g. genus-level representative database. Then run gather and use the prefetch output csv as a picklist, but select all signatures in the same genera (or family, etc) as any match.
Actually, even if you were to keep
tax grep
as just a picklist utility, being able to scale up from matches to all members of the taxonomic group could be pretty neat.
I responded:
Right... is this kind of an inverse operation?
You would want to take a list of lineages (perhaps from a prefetch or gather file - note that
sourmash tax
only deals with gather files for now) and then build a taxonomy? or a picklist? that expands those matches to another level.For example, you might:
* run gather * annotate gather results with taxonomy using `sourmash tax annotate` => strain level * 🪄 somehow 🪄 go from the lineages in the annotated gather file to a more general set of lineages at (say) the genus level
This strikes me as a pretty useful taxonomic utility, and points at functionality that is lacking -
* we don't really have anything that parses the annotated gather file, other than the metacoder example in #2041
- we don't have any way to manipulate a "bulk" taxonomy file in bulk ways, e.g. "give me all of the lineages from
taxfile1
that match at the genus level to the genomes/lineages intaxfile2
.... elided ...
maybe in addition to
tax grep
which works on a single match, we want a bulk matching function that takes in some format that links identifiers and taxonomies (annotated gather file? and/or taxonomy file?) as well as a taxonomy database, and then outputs picklists. "Promote these matches from strain to genus level" is one specific example here.Just roughing it out,
sourmash tax extract -g gather.csv -t gtdb.csv -r genus -o picklist.csv
would take the matches in gather.csv, use the taxonomy in gtdb.csv, pull them back to genus level, and output a picklist
@bluegenes
I think
tax extract
would be very useful and get us to the second use case!!!1. to select all members of specific family: `tax grep family_name` --> picklist 2. to promote prefetch matches to genus level: `tax annotate` --> `tax extract` --> picklist
Note -- If we're providing the taxonomy file to
tax extract
, we could even just do thetax annotate
step internally to avoid needing to run an extra step.Additional use case: use these picklists with
exclude
allows us to easily exclude entire taxonomic groups from search, e.g. for testing taxonomic classification.