TranscodingStreams.jl icon indicating copy to clipboard operation
TranscodingStreams.jl copied to clipboard

Get length of compressed stream so far without closing stream?

Open robertfeldt opened this issue 6 years ago • 2 comments

I would like to get the length of the compressed stream up to now but without closing the stream or affecting continued compression. I understand most Codecs might not support this, given their internal block lengths etc, but maybe there are ways to get related/close to this behavior?

The use case is something like this:

  • We have a very long string/stream which has been compressed already, C(s_long)
  • We now have a set of N shorter strings S_shorts = [s1, ..., sN] and we want to calculate map(length, [C(s_long * s1), ..., C(s_long * sN)]) but without having to redo the whole C(s_long) compression for each of the shorter strings si (since calculating C(s_long) might be costly in time).
  • Note that we only need the lengths of all the C(s_long * si), not their actual bytes.

Any ideas how this can be done as fast as possible? :)

Currently I basically do a Huffman coding/tree or dictionary-based compression by hand and can thus save the intermediate tree/dictionary between each consideration of the short strings, but it would be nice if there is a way to use more advanced compressors like the CodecX ones in the TranscodingStreams framework.

robertfeldt avatar Dec 04 '19 11:12 robertfeldt

This seems similar to https://stackoverflow.com/questions/11662745/how-can-one-copy-the-internal-state-of-zlib-compressor-object-in-python

I think a potential solution would be to add deepcopy support for Codecs.

nhz2 avatar Mar 17 '24 20:03 nhz2

Yes, deepcopy would really solve this. Not sure it's very performant (which is crucial in my case) but worth to try if there is a general use case for supporting deepcopy (at least for some codecs).

robertfeldt avatar Mar 18 '24 09:03 robertfeldt