sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

provide taxonomy operations that work on semicolon-separated lineages

Open ctb opened this issue 1 year ago • 2 comments

when we use sourmash tax annotate on gather results, we produce a column with semicolon-separated lineages in it. we don't have many (any?) sourmash subcommands that natively ingest that format, although we do have some parsing code here https://github.com/sourmash-bio/sourmash/issues/2041 for metacoder.

might be nice to think about tooling that easily interconverts between semicolon separated lineages and comma separated lineages, or something.

ctb avatar Aug 07 '22 13:08 ctb

(this came up during the discussion of tax grep over in https://github.com/sourmash-bio/sourmash/pull/2178#issuecomment-1206647255, and also seems relevant to some of the bigger select-on-metadata ideas out there e.g. https://github.com/sourmash-bio/sourmash/issues/2180)

ctb avatar Aug 07 '22 13:08 ctb

Implemented in #2333 - so, for example, the new summarize command would print out:

% sourmash tax summarize SRR606249-k31.x.gtdb.gather.with-lineages.csv                   

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 84 entries.
num idents: 84
rank superkingdom:        2 distinct identifiers
rank phylum:              25 distinct identifiers
rank class:               32 distinct identifiers
rank order:               42 distinct identifiers
rank family:              52 distinct identifiers
rank genus:               60 distinct identifiers
rank species:             84 distinct identifiers

and

% sourmash tax prepare -t SRR606249-k31.x.gtdb.gather.with-lineages.csv -o zzz.csv -F csv

works as well.

ctb avatar Oct 15 '22 17:10 ctb

semicolon-separated lineages and gather with-lineages output is now natively supported as a taxonomy spreadsheet and can be used with all tax commands per https://github.com/sourmash-bio/sourmash/pull/2333 🎉

ctb avatar Nov 14 '22 14:11 ctb