numcodecs icon indicating copy to clipboard operation
numcodecs copied to clipboard

Proof of concept for depending on external blosc python package

Open dstansby opened this issue 1 year ago • 2 comments

This is a proof of concept for depending on the blosc Python package for blosc compression, instead of using the sources we build ourselves. The goal here is to proove it's possible in as few as lines as possible.

If we go ahead with this we would need to first deprecate the following functions in numcodecs.blosc before the replacement could be done:

[
'destroy',
'init',
'compname_to_compcode',
'cbuffer_sizes',
'cbuffer_metainfo',
'err_bad_cname',
'decompress_partial'
]

TODO:

  • [ ] Unit tests and/or doctests in docstrings
  • [ ] Tests pass locally
  • [ ] Docstrings and API docs for any new/modified user-facing classes and functions
  • [ ] Changes documented in docs/release.rst
  • [ ] Docs build locally
  • [ ] GitHub Actions CI passes
  • [ ] Test coverage to 100% (Codecov passes)

dstansby avatar Aug 31 '24 19:08 dstansby

Codecov Report

:x: Patch coverage is 91.66667% with 7 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 90.77%. Comparing base (506c89b) to head (da13e9a). :warning: Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
numcodecs/blosc.py 91.02% 7 Missing :warning:

:x: Your patch status has failed because the patch coverage (91.66%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage. :x: Your project status has failed because the head coverage (90.77%) is below the target coverage (100.00%). You can increase the head coverage or adjust the target coverage.

:exclamation: There is a different number of reports uploaded between BASE (506c89b) and HEAD (da13e9a). Click for more details.

HEAD has 10 uploads less than BASE
Flag BASE (506c89b) HEAD (da13e9a)
15 5
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #571      +/-   ##
==========================================
- Coverage   99.96%   90.77%   -9.19%     
==========================================
  Files          64       65       +1     
  Lines        2789     2819      +30     
==========================================
- Hits         2788     2559     -229     
- Misses          1      260     +259     
Files with missing lines Coverage Δ
numcodecs/__init__.py 100.00% <ø> (ø)
numcodecs/tests/test_blosc.py 100.00% <100.00%> (ø)
numcodecs/blosc.py 91.02% <91.02%> (ø)

... and 2 files with indirect coverage changes

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Aug 31 '24 19:08 codecov[bot]

looking back at the commit history, it seems like blosc has been vendored in this repo for a very long time. @alimanfoo and / or @jakirkham what do you think about this PR?

d-v-b avatar Sep 01 '24 08:09 d-v-b

Since we are using the zstd and lz4 code from within the c-blosc repo, how would we deliver those codecs when using the blosc wheels?

normanrz avatar Nov 09 '24 11:11 normanrz

Hi @dstansby, this was a while ago now but IIRC part of the original motivation for vendoring was to be able to minimise memory copies. So maybe worth a check on whether using the Python blosc package would introduce any extra memory copies.

alimanfoo avatar Nov 21 '24 16:11 alimanfoo

Hey @dstansby, can you remind me why it is desirable to use the external blosc package instead of the submodule? I can see a number of downsides but I am not clear about the upsides.

normanrz avatar Jan 04 '25 20:01 normanrz

My idea here was to do a minimal proof of concept, to see if the tests still pass (ie, data is still encoded the same way) if we depend on the blosc Python package.

For me, the advantages would be:

  • Remove the need to compile c-blosc every time numcodecs is installed from source
  • Remove a bunch of wrapper code from numcodecs, reducing maintenance burden

I'm not really sure if this is desirable or not overall though, and from reading various PRs and issues I think there's still an open question about the future of numcodecs. I certainly don't intend to merge this any time soon and without wider discusssion (which unfortunately I don't have the time to make happen). Hopefully this shows that it's technically possible to switch to the Python blosc package though.

dstansby avatar Jan 04 '25 22:01 dstansby