sourmash
sourmash copied to clipboard
provide taxonomy operations that work on semicolon-separated lineages
when we use sourmash tax annotate
on gather results, we produce a column with semicolon-separated lineages in it. we don't have many (any?) sourmash subcommands that natively ingest that format, although we do have some parsing code here https://github.com/sourmash-bio/sourmash/issues/2041 for metacoder.
might be nice to think about tooling that easily interconverts between semicolon separated lineages and comma separated lineages, or something.
(this came up during the discussion of tax grep
over in https://github.com/sourmash-bio/sourmash/pull/2178#issuecomment-1206647255, and also seems relevant to some of the bigger select-on-metadata ideas out there e.g. https://github.com/sourmash-bio/sourmash/issues/2180)
Implemented in #2333 - so, for example, the new summarize command would print out:
% sourmash tax summarize SRR606249-k31.x.gtdb.gather.with-lineages.csv
== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
loading taxonomies...
...loaded 84 entries.
num idents: 84
rank superkingdom: 2 distinct identifiers
rank phylum: 25 distinct identifiers
rank class: 32 distinct identifiers
rank order: 42 distinct identifiers
rank family: 52 distinct identifiers
rank genus: 60 distinct identifiers
rank species: 84 distinct identifiers
and
% sourmash tax prepare -t SRR606249-k31.x.gtdb.gather.with-lineages.csv -o zzz.csv -F csv
works as well.
semicolon-separated lineages and gather with-lineages
output is now natively supported as a taxonomy spreadsheet and can be used with all tax
commands per https://github.com/sourmash-bio/sourmash/pull/2333 🎉