sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

[WIP] add `tax summarize`

Open ctb opened this issue 1 year ago • 1 comments

This PR adds a tax summarize command per #2212.

It also:

  • tackles native loading of with-lineages files produced by tax annotate as taxonomy spreadsheets (not yet implemented)
  • improves error reporting output for wonky unicode formatted tax CSV files for https://github.com/sourmash-bio/sourmash/issues/2326

Tackles https://github.com/sourmash-bio/sourmash/issues/2212 Tackles https://github.com/sourmash-bio/sourmash/issues/2185 Tackles parts of https://github.com/sourmash-bio/sourmash/issues/2326

TODO

  • [ ] tests!
  • [ ] docs!
  • [ ] provide "linting" style output?
  • [ ] maybe we want to use this command, or a separate command, to compare b/t a set of signatures (or a manifest...) and a set of taxonomies? e.g. tax crosscheck --db db --taxonomy <taxonomy> that will tell us which identifiers don't have taxonomy, and which taxonomy entries don't have sketches?

Example output

Running on a traditional taxonomy file:

% sourmash tax summarize gtdb-rs202.taxonomy.v2.db                      

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 258406 entries.
num idents: 258406
rank superkingdom:        2 distinct identifiers
rank phylum:              169 distinct identifiers
rank class:               419 distinct identifiers
rank order:               1312 distinct identifiers
rank family:              3264 distinct identifiers
rank genus:               12888 distinct identifiers
rank species:             47894 distinct identifiers

On a gather-with-lineages file:

% sourmash tax summarize SRR606249-k31.x.gtdb.gather.with-lineages.csv                   

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 84 entries.
num idents: 84
rank superkingdom:        2 distinct identifiers
rank phylum:              25 distinct identifiers
rank class:               32 distinct identifiers
rank order:               42 distinct identifiers
rank family:              52 distinct identifiers
rank genus:               60 distinct identifiers
rank species:             84 distinct identifiers

On the bad CSV file from #2326 -

% sourmash tax summarize /Users/t/Downloads/cheesegenomes.lineages.csv

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
ERROR while loading taxonomies!
cannot read taxonomy assignments from '/Users/t/Downloads/cheesegenomes.lineages.csv': No taxonomic identifiers found; headers are '\ufeffident','taxid','superkingdom','phylum','class','order','family','genus','species','strain'

CSV output of per-rank information

With CSV output,

% sourmash tax summarize gtdb-rs207.taxonomy.sqldb -o aaa.csv

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 317542 entries.
num idents: 317542
rank superkingdom:        2 distinct identifiers
rank phylum:              189 distinct identifiers
rank class:               481 distinct identifiers
rank order:               1593 distinct identifiers
rank family:              4107 distinct identifiers
rank genus:               16686 distinct identifiers
rank species:             65703 distinct identifiers
now calculating detailed lineage counts...
...done!
saved 88761 lineage counts to 'aaa.csv'

and aaa.csv looks like:

rank count lineage
0 superkingdom 311480 d__Bacteria
1 phylum 141114 d__Bacteria;p__Proteobacteria
2 class 121804 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria
3 order 74108 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales
4 family 63971 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae
5 phylum 61795 d__Bacteria;p__Firmicutes
6 class 61794 d__Bacteria;p__Firmicutes;c__Bacilli
7 order 32177 d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales
8 phylum 28532 d__Bacteria;p__Actinobacteriota
9 genus 27205 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia

ctb avatar Oct 15 '22 16:10 ctb

Codecov Report

Merging #2333 (b5ac05c) into latest (2198b32) will increase coverage by 8.01%. The diff coverage is 79.20%.

@@            Coverage Diff             @@
##           latest    #2333      +/-   ##
==========================================
+ Coverage   83.97%   91.98%   +8.01%     
==========================================
  Files         129      101      -28     
  Lines       14967    11518    -3449     
  Branches     2191     2215      +24     
==========================================
- Hits        12568    10595    -1973     
+ Misses       2104      621    -1483     
- Partials      295      302       +7     
Flag Coverage Δ
python 91.98% <79.20%> (-0.12%) :arrow_down:
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/tax/tax_utils.py 95.98% <57.50%> (-2.35%) :arrow_down:
src/sourmash/tax/__main__.py 92.35% <91.30%> (-0.28%) :arrow_down:
src/sourmash/cli/tax/__init__.py 100.00% <100.00%> (ø)
src/sourmash/cli/tax/summarize.py 100.00% <100.00%> (ø)
src/core/src/index/mod.rs
src/core/src/ffi/minhash.rs
src/core/src/ffi/nodegraph.rs
src/core/src/encodings.rs
src/core/src/index/search.rs
src/core/src/index/revindex.rs
... and 23 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov[bot] avatar Oct 15 '22 16:10 codecov[bot]