sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

WIP: adjust warnings around tax abund and provide v5 upgrades to `tax metagenome`

Open ctb opened this issue 5 months ago • 2 comments

Add new command line arguments to tax metagenome around abundances. --use-abundances turns on the use of sketch abundances for krona and XXX output formats, while --ignore-abundances/--no-abundances turns them off.

Also provides a warning if no abundances are available for tax metagenome to use, or if --use-abundances/--ignore-abundances is not explicitly set (for sourmash v4).

For sourmash v5, --use-abundances becomes default and it is an error to not have abundances. This can be overridden with --ignore-abundances.

For sourmash v5, this PR also changes the default output format for tax genome and tax metagenome to -F human from -F csv_summary (https://github.com/sourmash-bio/sourmash/issues/2162).

Uses --v4/--v5 command line switches and test infrastructure, ref https://github.com/sourmash-bio/sourmash/issues/3076.

TODO:

  • [x] finish writing tests
  • [x] test: warning if no abundances provided/used (v4)
  • [x] test: no warning if --ignore-abund, and no abundances used (v4/v5)
  • [x] test: error if no abundances provided & no ignore (v5)
  • [x] test: error if no abundances provided and --use-abund
  • [x] test: abundances used if --use-abund (v4/v5)
  • [x] test: abundances used if v5
  • [x] test: default output format for metagenome is human (v5)
  • [x] test: default output format for genome is csv_summary (v4)
  • [x] test: abundances not used with tax genome
  • [ ] fix @CTB and XXX notes stuff
  • [ ] actually test & examine the krona and kreport output formats ;)
  • [ ] test: default output format for metagenome is human (v5)
  • [ ] test: default output format for genome is csv_summary (v4)
  • [ ] add comments to #3577 and #3598
  • [ ] add documentation

Fixes https://github.com/sourmash-bio/sourmash/issues/2162 Fixes https://github.com/sourmash-bio/sourmash/issues/3577 Fixes https://github.com/sourmash-bio/sourmash/issues/3598

ctb avatar Jun 27 '25 15:06 ctb

Codecov Report

:x: Patch coverage is 96.10390% with 3 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 88.19%. Comparing base (c94f143) to head (7ec27ae). :warning: Report is 54 commits behind head on latest.

Files with missing lines Patch % Lines
src/sourmash/tax/tax_utils.py 92.85% 1 Missing and 1 partial :warning:
src/sourmash/tax/__main__.py 97.56% 0 Missing and 1 partial :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           latest    #3711      +/-   ##
==========================================
+ Coverage   88.17%   88.19%   +0.01%     
==========================================
  Files         137      137              
  Lines       22491    22552      +61     
  Branches     2281     2298      +17     
==========================================
+ Hits        19831    19889      +58     
- Misses       2347     2348       +1     
- Partials      313      315       +2     
Flag Coverage Δ
hypothesis-py 25.27% <9.09%> (-0.12%) :arrow_down:
python 92.66% <96.10%> (+0.01%) :arrow_up:
rust 81.62% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Jun 27 '25 15:06 codecov[bot]

  • tax metagenome in v5 (and with --v5) should REQUIRE abundances unless an option (--no-abund?) is provided;
  • tax metagenome in v4 should WARN or ERROR if (a) no abundances are provided (b) abundances are available but not being used

I like it!

bluegenes avatar Jun 27 '25 15:06 bluegenes

See new validation workflow here: https://github.com/sourmash-bio/2025-validate-sourmash-tax-formats.

ctb avatar Jul 29 '25 11:07 ctb

Ready for review @bluegenes!

This got really complicated from a code and testing perspective. My suggestion for detailed review efforts are to confirm that the various test_metagenome_abund_reporting_* tests match the PR descr and docs... Those tests should be comprehensive for the abundance-related changes in this PR.

ctb avatar Jul 31 '25 14:07 ctb