sourmash icon indicating copy to clipboard operation
sourmash copied to clipboard

[MRG] Add `FrozenSourmashSignature`

Open ctb opened this issue 3 years ago • 3 comments

Adds a FrozenSourmashSignature class, and provides sensible to_mutable() and to_frozen() methods on SourmashSignature and FrozenSourmashSignature. Provides an update() context manager that wraps changes so that a FrozenSourmashSignature is left at the end.

Complement to #1508, which did this with MinHash.

As with #1508, very few changes are required to make all the tests pass. The ease of doing this suggests to me that this is a Good Change.

Also as with #1508, I really like this updated code; it's so nice and clear!

I am hoping that these changes make it easier for @luizirber in Rust/Python interaction.

On the other hand, this could break some 3rd party code (as #1508 did :). Admittedly we are the owners of most of that code, but 🤷 not sure if we should delay this 'til 5.0. On the other hand, could fit nicely with a 4.2.0 release.

TODO:

  • [x] update places that unnecessarily call flatten and downsample (b/c signatures are already flat and/or downsampled)
  • [x] check that SourmashSignature by default owns a frozen MinHash (I think it does, but confirm)
  • [x] check that SourmashSignature.to_frozen() also freezes the MinHash object.
  • [x] maybe make a new method that does the ss = ss.to_mutable(); ss.minhash = xyz; ss = ss.to_frozen() dance;
  • [x] OR, use a context manager? with sig.mutate() as sig:?
  • [x] revisit into_frozen() and into_mutable()
  • [x] add docstrings :)
  • [x] consider making error messages clearer & providing guidance around using to_mutable()

ctb avatar Jun 19 '21 15:06 ctb

Codecov Report

Merging #1610 (bee45ad) into latest (b1b4ef7) will increase coverage by 7.40%. The diff coverage is 91.34%.

@@            Coverage Diff             @@
##           latest    #1610      +/-   ##
==========================================
+ Coverage   84.70%   92.10%   +7.40%     
==========================================
  Files         131      100      -31     
  Lines       15544    11343    -4201     
  Branches     2219     2236      +17     
==========================================
- Hits        13166    10448    -2718     
+ Misses       2085      601    -1484     
- Partials      293      294       +1     
Flag Coverage Δ
python 92.10% <91.34%> (+0.04%) :arrow_up:
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/signature.py 93.86% <84.09%> (+2.05%) :arrow_up:
src/sourmash/minhash.py 93.76% <90.90%> (-0.27%) :arrow_down:
src/sourmash/commands.py 90.78% <96.87%> (+0.06%) :arrow_up:
src/sourmash/index/__init__.py 96.71% <100.00%> (ø)
src/sourmash/lca/command_classify.py 81.48% <100.00%> (+0.71%) :arrow_up:
src/sourmash/lca/lca_db.py 93.29% <100.00%> (+0.01%) :arrow_up:
src/sourmash/search.py 97.94% <100.00%> (+<0.01%) :arrow_up:
src/sourmash/sig/__main__.py 94.02% <100.00%> (+0.02%) :arrow_up:
src/core/src/cmd.rs
... and 31 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov-commenter avatar Jun 19 '21 15:06 codecov-commenter

@luizirber opinions?

ctb avatar Jun 20 '21 01:06 ctb

Ready for review & merge @sourmash-bio/devs !

ctb avatar Aug 03 '22 16:08 ctb

hi @luizirber I think this PR is best reviewed by you, if you have time. It's a quick read, I promise ;). And you might have a suggestion for how to improve, too - I'm not thrilled with the context manager syntax I use.

ctb avatar Aug 26 '22 13:08 ctb

I have a bunch of commits in #2230 that flip the MinHash to default to large ones (b-tree based) and expose the Frozen ones (vec-based), should I add the commits here or make a new PR merging into this one?

(I can also just merge them together in #2230, since that will be a breaking change version bump on the Rust side and I would like to avoid doing two in a row :grimacing: )

luizirber avatar Aug 28 '22 16:08 luizirber

I have a bunch of commits in #2230 that flip the MinHash to default to large ones (b-tree based) and expose the Frozen ones (vec-based), should I add the commits here or make a new PR merging into this one?

(I can also just merge them together in #2230, since that will be a breaking change version bump on the Rust side and I would like to avoid doing two in a row 😬 )

your call - if last option is easiest, go for it!

ctb avatar Aug 28 '22 16:08 ctb

I tested out with the pile of changes coming in #2230 and it works. In the future might want to avoid all the copying, but as you pointed in the comment for the context manager it is hard to track bugs otherwise...

Another future optimization is being more strict on the Mutable/Frozen on the Rust side, right now they map to

  • Mutable: KmerMinHashBTree
  • Frozen: KmerMinHash but they are both mutable (it's just that the Frozen one is much slower to mutate with large sketches).

luizirber avatar Aug 28 '22 23:08 luizirber

awesome, thanks!! should I merge, then?

ctb avatar Aug 28 '22 23:08 ctb

awesome, thanks!! should I merge, then?

Yup, go for it!

luizirber avatar Aug 29 '22 00:08 luizirber

🎉

ctb avatar Aug 29 '22 01:08 ctb

In the future might want to avoid all the copying, but as you pointed in the comment for the context manager it is hard to track bugs otherwise...

I would have put more effort in to this, but note that SourmashSignature really only contains name and filename. The .minhash property is a FrozenMinHash that is copied around by reference. So all the copying doesn't really matter for efficiency, or at least that's what I decided after reading through the code a bunch of times :)

ctb avatar Aug 29 '22 12:08 ctb