Add ZstdFrameCompressor and ZstdFrameDecompressor for non-streaming API
Also add ZstdError exception type for more descriptive errors. ZstdError is only implemented for the new compressors. They will be applied to the streaming compressors for a breaking version bump.
Codecov Report
Attention: Patch coverage is 75.65217% with 28 lines in your changes are missing coverage. Please review.
Project coverage is 42.85%. Comparing base (
ad7288a) to head (2f697b7).
| Files | Patch % | Lines |
|---|---|---|
| src/frameCompression.jl | 76.08% | 11 Missing :warning: |
| src/frameDecompression.jl | 80.00% | 9 Missing :warning: |
| src/libzstd.jl | 68.18% | 7 Missing :warning: |
| src/compression.jl | 50.00% | 1 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #46 +/- ##
==========================================
+ Coverage 33.01% 42.85% +9.84%
==========================================
Files 5 7 +2
Lines 524 637 +113
==========================================
+ Hits 173 273 +100
- Misses 351 364 +13
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
This is meant to support https://github.com/JuliaIO/JLD2.jl/pull/560 which requires the non-streaming compression API where blocks know their decompressed size.
Could you try to dev this with your JLD2.jl changes, @milankl?
Could the ZSTD_CCtx_setPledgedSrcSize function in https://facebook.github.io/zstd/zstd_manual.html#Chapter5 be used to set the frame size and also keep the streaming API? Also is there a reason for ZstdFrameDecompressor? IIUC the regular ZstdDecompressor should work with both formats.
Also, not directly related to this PR, but maybe there should be a new package for supporting non-streaming codecs with a simpler API like https://numcodecs.readthedocs.io/en/stable/abc.html
After installing your branch here and my branch from https://github.com/JuliaIO/JLD2.jl/pull/560, I get this
julia> using JLD2, CodecZstd, HDF5, H5Zzstd
│ Package H5Zzstd not found, but a package named H5Zzstd is available from a
│ registry.
│ Install package?
│ (@v1.10) pkg> add H5Zzstd
└ (y/n/o) [y]: y
Resolving package versions...
ERROR: Unsatisfiable requirements detected for package H5Zzstd [f6f2d980]:
H5Zzstd [f6f2d980] log:
├─possible versions are: 0.1.0-0.1.1 or uninstalled
├─restricted to versions * by an explicit requirement, leaving only versions: 0.1.0-0.1.1
└─restricted by compatibility requirements with CodecZstd [6b39b394] to versions: uninstalled — no versions left
└─CodecZstd [6b39b394] log:
├─possible versions are: 0.8.2 or uninstalled
└─CodecZstd [6b39b394] is fixed to version 0.8.2
Not sure I can make sense of this?
EDIT: This is because with
(@v1.10) pkg> st H5Zzstd
Status `~/.julia/environments/v1.10/Project.toml`
[f6f2d980] H5Zzstd v0.1.1
The CodecZstd version is limited to
(@v1.10) pkg> status --outdated
Status `~/.julia/environments/v1.10/Project.toml`
⌅ [6b39b394] CodecZstd v0.7.2 (<v0.8.2): H5Zzstd
IIUC the regular
ZstdDecompressorshould work with both formats.
I can test this @nhz2.
EDIT: Yes, works as before so doesn't matter whether setting ZstdDecompressor or ZstdFrameDecompressor in JLD2, see https://github.com/JuliaIO/JLD2.jl/pull/560
julia> using JLD2, CodecZstd
julia> A = zeros(1000,1000);
julia> A[1] = rand()
0.6101634539467033
julia> save("test_with_zstd_compression.jld2", "A", A, compress=ZstdFrameCompressor())
julia> A == load("test_with_zstd_compression.jld2", "A")
true
@mkitti this works:
julia> using JLD2, CodecZstd
julia> A = zeros(1000,1000);
julia> A[1] = rand()
0.1653492903631878
julia> save("test_with_zstd_compression.jld2", "A", A, compress=ZstdFrameCompressor())
julia> A == load("test_with_zstd_compression.jld2", "A")
true
while compres=ZstdCompressor() throws an unsupported error as expected...
And decompression via HDF5 too (in another environment due to the pkg conflicts above)
julia> using HDF5, H5Zzstd
julia> h5open("test_with_zstd_compression.jld2") do h5f
h5f["A"][]
end
1000×1000 Matrix{Float64}:
0.165349 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
⋮ ⋮ ⋱ ⋮
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Also, not directly related to this PR, but maybe there should be a new package for supporting non-streaming codecs with a simpler API like https://numcodecs.readthedocs.io/en/stable/abc.html
At the end of the day, I think the public API for transcoding streams is actually simpler and more flexible. I'm involved over in numcodecs as well actually.
Transcoding streams is simpler in that there is only transcode rather than encode and decode. However, we can also stream.
The implementation here is trickier initially, but ultimately but can be manipulated for a lot of circumstances.
You are correct that the non-streaming API is basically just a special case of the streaming API. The non-streaming "frame" API assumes
- We have all the bytes needed immediately available.
- We know the total length of the result.
An important consequence of the above is that we can do a single allocation for the result whereas for streaming we allow for multiple allocations to extend the buffers as needed.
I don't have any opinion on the streaming vs non-streaming API... I just would like to get his here merged so that we can merge https://github.com/JuliaIO/JLD2.jl/pull/560 and have Zstd compression in JLD2 😉
Can this be implemented as a new pledgedSrcSize keyword argument with a default value of ZSTD_CONTENTSIZE_UNKNOWN when constructing the ZstdCompressor? The docs for ZSTD_getFrameContentSize say decompressed size is an optional field.
Can this be implemented as a new
pledgedSrcSizekeyword argument with a default value ofZSTD_CONTENTSIZE_UNKNOWNwhen constructing theZstdCompressor? The docs forZSTD_getFrameContentSizesay decompressed size is an optional field.
If we pursue this path, we would need to figure out how to modify JLD2.jl as well to work with this.
Okay, I took a look at https://github.com/JuliaIO/JLD2.jl/pull/560 and why it would be nicer to have a special ZstdFrameCompressor that matches what HDF5 is doing. Maybe adding an internal buffer to build up the frame before calling ZSTD_compress2 when input size is zero would work well. I think the same approach could be done to add Blosc and LZ4.
I'm starting to favor the approach in https://github.com/JuliaIO/CodecZstd.jl/pull/49 where we might be able to get the streaming API to save the encoded size in the frame by providing endOp = :end. I'm not sure how to access that keyword via the TranscodingStreams API yet.
We may end using a similar API that exists here. For example, ZstdFrameCompressor() would create a ZstdCompressor that would initially pass endOp = :end.
Alternatively, we try the approach in https://github.com/JuliaIO/TranscodingStreams.jl/pull/215.
Closing in favor of #52 .