CodecZstd.jl icon indicating copy to clipboard operation
CodecZstd.jl copied to clipboard

Add ZstdFrameCompressor and ZstdFrameDecompressor for non-streaming API

Open mkitti opened this issue 1 year ago • 13 comments

Also add ZstdError exception type for more descriptive errors. ZstdError is only implemented for the new compressors. They will be applied to the streaming compressors for a breaking version bump.

mkitti avatar May 14 '24 08:05 mkitti

Codecov Report

Attention: Patch coverage is 75.65217% with 28 lines in your changes are missing coverage. Please review.

Project coverage is 42.85%. Comparing base (ad7288a) to head (2f697b7).

Files Patch % Lines
src/frameCompression.jl 76.08% 11 Missing :warning:
src/frameDecompression.jl 80.00% 9 Missing :warning:
src/libzstd.jl 68.18% 7 Missing :warning:
src/compression.jl 50.00% 1 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master      #46      +/-   ##
==========================================
+ Coverage   33.01%   42.85%   +9.84%     
==========================================
  Files           5        7       +2     
  Lines         524      637     +113     
==========================================
+ Hits          173      273     +100     
- Misses        351      364      +13     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar May 14 '24 08:05 codecov[bot]

This is meant to support https://github.com/JuliaIO/JLD2.jl/pull/560 which requires the non-streaming compression API where blocks know their decompressed size.

mkitti avatar May 14 '24 08:05 mkitti

Could you try to dev this with your JLD2.jl changes, @milankl?

mkitti avatar May 14 '24 08:05 mkitti

Could the ZSTD_CCtx_setPledgedSrcSize function in https://facebook.github.io/zstd/zstd_manual.html#Chapter5 be used to set the frame size and also keep the streaming API? Also is there a reason for ZstdFrameDecompressor? IIUC the regular ZstdDecompressor should work with both formats.

nhz2 avatar May 14 '24 14:05 nhz2

Also, not directly related to this PR, but maybe there should be a new package for supporting non-streaming codecs with a simpler API like https://numcodecs.readthedocs.io/en/stable/abc.html

nhz2 avatar May 14 '24 14:05 nhz2

After installing your branch here and my branch from https://github.com/JuliaIO/JLD2.jl/pull/560, I get this

julia> using JLD2, CodecZstd, HDF5, H5Zzstd
 │ Package H5Zzstd not found, but a package named H5Zzstd is available from a
 │ registry.
 │ Install package?
 │   (@v1.10) pkg> add H5Zzstd
 └ (y/n/o) [y]: y
   Resolving package versions...
ERROR: Unsatisfiable requirements detected for package H5Zzstd [f6f2d980]:
 H5Zzstd [f6f2d980] log:
 ├─possible versions are: 0.1.0-0.1.1 or uninstalled
 ├─restricted to versions * by an explicit requirement, leaving only versions: 0.1.0-0.1.1
 └─restricted by compatibility requirements with CodecZstd [6b39b394] to versions: uninstalled — no versions left
   └─CodecZstd [6b39b394] log:
     ├─possible versions are: 0.8.2 or uninstalled
     └─CodecZstd [6b39b394] is fixed to version 0.8.2

Not sure I can make sense of this?

EDIT: This is because with

(@v1.10) pkg> st H5Zzstd
Status `~/.julia/environments/v1.10/Project.toml`
  [f6f2d980] H5Zzstd v0.1.1

The CodecZstd version is limited to


(@v1.10) pkg> status --outdated
Status `~/.julia/environments/v1.10/Project.toml`
⌅ [6b39b394] CodecZstd v0.7.2 (<v0.8.2): H5Zzstd

milankl avatar May 14 '24 18:05 milankl

IIUC the regular ZstdDecompressor should work with both formats.

I can test this @nhz2.

EDIT: Yes, works as before so doesn't matter whether setting ZstdDecompressor or ZstdFrameDecompressor in JLD2, see https://github.com/JuliaIO/JLD2.jl/pull/560

julia> using JLD2, CodecZstd

julia> A = zeros(1000,1000);

julia> A[1] = rand()
0.6101634539467033

julia> save("test_with_zstd_compression.jld2", "A", A, compress=ZstdFrameCompressor())

julia> A == load("test_with_zstd_compression.jld2", "A")
true

milankl avatar May 14 '24 18:05 milankl

@mkitti this works:

julia> using JLD2, CodecZstd

julia> A = zeros(1000,1000);

julia> A[1] = rand()
0.1653492903631878

julia> save("test_with_zstd_compression.jld2", "A", A, compress=ZstdFrameCompressor())

julia> A == load("test_with_zstd_compression.jld2", "A")
true

while compres=ZstdCompressor() throws an unsupported error as expected... And decompression via HDF5 too (in another environment due to the pkg conflicts above)

julia> using HDF5, H5Zzstd

julia> h5open("test_with_zstd_compression.jld2") do h5f
          h5f["A"][]
       end
1000×1000 Matrix{Float64}:
 0.165349  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 ⋮                             ⋮                   ⋱                      ⋮
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0       0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0

milankl avatar May 14 '24 18:05 milankl

Also, not directly related to this PR, but maybe there should be a new package for supporting non-streaming codecs with a simpler API like https://numcodecs.readthedocs.io/en/stable/abc.html

At the end of the day, I think the public API for transcoding streams is actually simpler and more flexible. I'm involved over in numcodecs as well actually.

Transcoding streams is simpler in that there is only transcode rather than encode and decode. However, we can also stream.

The implementation here is trickier initially, but ultimately but can be manipulated for a lot of circumstances.

You are correct that the non-streaming API is basically just a special case of the streaming API. The non-streaming "frame" API assumes

  1. We have all the bytes needed immediately available.
  2. We know the total length of the result.

An important consequence of the above is that we can do a single allocation for the result whereas for streaming we allow for multiple allocations to extend the buffers as needed.

mkitti avatar May 14 '24 18:05 mkitti

I don't have any opinion on the streaming vs non-streaming API... I just would like to get his here merged so that we can merge https://github.com/JuliaIO/JLD2.jl/pull/560 and have Zstd compression in JLD2 😉

milankl avatar May 14 '24 18:05 milankl

Can this be implemented as a new pledgedSrcSize keyword argument with a default value of ZSTD_CONTENTSIZE_UNKNOWN when constructing the ZstdCompressor? The docs for ZSTD_getFrameContentSize say decompressed size is an optional field.

nhz2 avatar May 14 '24 19:05 nhz2

Can this be implemented as a new pledgedSrcSize keyword argument with a default value of ZSTD_CONTENTSIZE_UNKNOWN when constructing the ZstdCompressor? The docs for ZSTD_getFrameContentSize say decompressed size is an optional field.

If we pursue this path, we would need to figure out how to modify JLD2.jl as well to work with this.

mkitti avatar May 14 '24 22:05 mkitti

Okay, I took a look at https://github.com/JuliaIO/JLD2.jl/pull/560 and why it would be nicer to have a special ZstdFrameCompressor that matches what HDF5 is doing. Maybe adding an internal buffer to build up the frame before calling ZSTD_compress2 when input size is zero would work well. I think the same approach could be done to add Blosc and LZ4.

nhz2 avatar May 14 '24 23:05 nhz2

I'm starting to favor the approach in https://github.com/JuliaIO/CodecZstd.jl/pull/49 where we might be able to get the streaming API to save the encoded size in the frame by providing endOp = :end. I'm not sure how to access that keyword via the TranscodingStreams API yet.

We may end using a similar API that exists here. For example, ZstdFrameCompressor() would create a ZstdCompressor that would initially pass endOp = :end.

Alternatively, we try the approach in https://github.com/JuliaIO/TranscodingStreams.jl/pull/215.

mkitti avatar May 19 '24 20:05 mkitti

Closing in favor of #52 .

mkitti avatar May 29 '24 03:05 mkitti