sourmash
sourmash copied to clipboard
[WIP] add `tax summarize`
This PR adds a tax summarize
command per #2212.
It also:
- tackles native loading of with-lineages files produced by
tax annotate
as taxonomy spreadsheets (not yet implemented) - improves error reporting output for wonky unicode formatted tax CSV files for https://github.com/sourmash-bio/sourmash/issues/2326
Tackles https://github.com/sourmash-bio/sourmash/issues/2212 Tackles https://github.com/sourmash-bio/sourmash/issues/2185 Tackles parts of https://github.com/sourmash-bio/sourmash/issues/2326
TODO
- [ ] tests!
- [ ] docs!
- [ ] provide "linting" style output?
- [ ] maybe we want to use this command, or a separate command, to compare b/t a set of signatures (or a manifest...) and a set of taxonomies? e.g.
tax crosscheck --db db --taxonomy <taxonomy>
that will tell us which identifiers don't have taxonomy, and which taxonomy entries don't have sketches?
Example output
Running on a traditional taxonomy file:
% sourmash tax summarize gtdb-rs202.taxonomy.v2.db
== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
loading taxonomies...
...loaded 258406 entries.
num idents: 258406
rank superkingdom: 2 distinct identifiers
rank phylum: 169 distinct identifiers
rank class: 419 distinct identifiers
rank order: 1312 distinct identifiers
rank family: 3264 distinct identifiers
rank genus: 12888 distinct identifiers
rank species: 47894 distinct identifiers
On a gather-with-lineages file:
% sourmash tax summarize SRR606249-k31.x.gtdb.gather.with-lineages.csv
== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
loading taxonomies...
...loaded 84 entries.
num idents: 84
rank superkingdom: 2 distinct identifiers
rank phylum: 25 distinct identifiers
rank class: 32 distinct identifiers
rank order: 42 distinct identifiers
rank family: 52 distinct identifiers
rank genus: 60 distinct identifiers
rank species: 84 distinct identifiers
On the bad CSV file from #2326 -
% sourmash tax summarize /Users/t/Downloads/cheesegenomes.lineages.csv
== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
loading taxonomies...
ERROR while loading taxonomies!
cannot read taxonomy assignments from '/Users/t/Downloads/cheesegenomes.lineages.csv': No taxonomic identifiers found; headers are '\ufeffident','taxid','superkingdom','phylum','class','order','family','genus','species','strain'
CSV output of per-rank information
With CSV output,
% sourmash tax summarize gtdb-rs207.taxonomy.sqldb -o aaa.csv
== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
loading taxonomies...
...loaded 317542 entries.
num idents: 317542
rank superkingdom: 2 distinct identifiers
rank phylum: 189 distinct identifiers
rank class: 481 distinct identifiers
rank order: 1593 distinct identifiers
rank family: 4107 distinct identifiers
rank genus: 16686 distinct identifiers
rank species: 65703 distinct identifiers
now calculating detailed lineage counts...
...done!
saved 88761 lineage counts to 'aaa.csv'
and aaa.csv
looks like:
rank | count | lineage | |
---|---|---|---|
0 | superkingdom | 311480 | d__Bacteria |
1 | phylum | 141114 | d__Bacteria;p__Proteobacteria |
2 | class | 121804 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria |
3 | order | 74108 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales |
4 | family | 63971 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae |
5 | phylum | 61795 | d__Bacteria;p__Firmicutes |
6 | class | 61794 | d__Bacteria;p__Firmicutes;c__Bacilli |
7 | order | 32177 | d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales |
8 | phylum | 28532 | d__Bacteria;p__Actinobacteriota |
9 | genus | 27205 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia |
Codecov Report
Merging #2333 (b5ac05c) into latest (2198b32) will increase coverage by
8.01%
. The diff coverage is79.20%
.
@@ Coverage Diff @@
## latest #2333 +/- ##
==========================================
+ Coverage 83.97% 91.98% +8.01%
==========================================
Files 129 101 -28
Lines 14967 11518 -3449
Branches 2191 2215 +24
==========================================
- Hits 12568 10595 -1973
+ Misses 2104 621 -1483
- Partials 295 302 +7
Flag | Coverage Δ | |
---|---|---|
python | 91.98% <79.20%> (-0.12%) |
:arrow_down: |
rust | ? |
Flags with carried forward coverage won't be shown. Click here to find out more.
Impacted Files | Coverage Δ | |
---|---|---|
src/sourmash/tax/tax_utils.py | 95.98% <57.50%> (-2.35%) |
:arrow_down: |
src/sourmash/tax/__main__.py | 92.35% <91.30%> (-0.28%) |
:arrow_down: |
src/sourmash/cli/tax/__init__.py | 100.00% <100.00%> (ø) |
|
src/sourmash/cli/tax/summarize.py | 100.00% <100.00%> (ø) |
|
src/core/src/index/mod.rs | ||
src/core/src/ffi/minhash.rs | ||
src/core/src/ffi/nodegraph.rs | ||
src/core/src/encodings.rs | ||
src/core/src/index/search.rs | ||
src/core/src/index/revindex.rs | ||
... and 23 more |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more