TranscodingStreams.jl icon indicating copy to clipboard operation
TranscodingStreams.jl copied to clipboard

universal in-place decompression interface

Open Moelf opened this issue 2 years ago • 9 comments

by "in-place" I mean the user would pre-allocate a Vector{UInt8} or something as the sink

sometimes decompression is needed for low-level stuff, such as handling "buffers" in a file spec, and multiple decompression together assemble the entire data blob.

it would be nice if there's a in-place interface that support multiple algorithms through this central package.

Moelf avatar Feb 26 '23 02:02 Moelf

I've drafted a PR around Buffers to support faster decompression when loading Arrow files.

Would that support your use case as well? Or would you need the sink to be ByteData (instead of buffer)?

svilupp avatar Mar 09 '23 12:03 svilupp

Closed by #132 and released as 0.9.12

mkitti avatar Apr 11 '23 09:04 mkitti

I'm reopening this issue because I think there are some issues to work on with the current interface.

  1. The Buffer type used in #136 is internal. See #202
  2. For use in Zarr.jl it would be helpful to be able to decompress directly into for example a Vector{Float64} to avoid an extra copy.
  3. The underlying codec should be informed somehow that it is doing a fully in-place operation, so it can internally avoid extra buffering and copies. Ref: https://github.com/JuliaIO/CodecZstd.jl/pull/52

nhz2 avatar Jul 05 '24 14:07 nhz2

somehow I miseed @mkitti 's original comment since the "closed by" refer to this very issue, what was the PR that supposedly fixed this?

Moelf avatar Jul 05 '24 17:07 Moelf

For use in Zarr.jl it would be helpful to be able to decompress directly into for example a Vector{Float64} to avoid an extra copy.

this can't work directly, the two possibilities are you have a buffer = reinterpret(UInt8, ...) and you give this buffer to TranscodingStreams.

Or, you have data = reinterpret(Float64, buffer) and give the buffer to TranscodingStreams

Moelf avatar Jul 05 '24 17:07 Moelf

Yes, I think this would require a new unsafe_transcode! function that works directly with pointers.

nhz2 avatar Jul 05 '24 17:07 nhz2

Also, a general unsafe_transcode! interface could be useful for other packages that don't support or need a streaming API like Blosc.jl, LibDeflate.jl, JLD2.jl, HDF5.jl, Zarr.jl... so maybe it should go in a separate LosslessChunkCompressors.jl package, and be added as a dependency here.

nhz2 avatar Jul 05 '24 17:07 nhz2

@Moelf @mkitti I have a draft interface for in-place encoding and decoding defined in https://github.com/nhz2/ChunkCodecs.jl/blob/main/ChunkCodecCore/src/interface.jl

The interface currently doesn't directly use pointers, which is nice for avoiding GC issues, but sometimes things don't work as expected, for example decoding a view of a PyArray: https://github.com/JuliaPy/PythonCall.jl/issues/579

nhz2 avatar Dec 10 '24 04:12 nhz2

That's interesting. I will try to take a closer look next week.

mkitti avatar Dec 10 '24 04:12 mkitti