gzp icon indicating copy to clipboard operation
gzp copied to clipboard

Supporting serde serialization and deserialization of encoder object

Open Some1and2-XC opened this issue 10 months ago • 8 comments

Hey, I'm not really sure how difficult this would be but would it be at all possible to make the ParCompress<Zlib> encoder support serialization?

The use case for this is to allow for saving the encoder state using something like this to disk so that application errors can be recovered.

Some1and2-XC avatar Mar 04 '25 16:03 Some1and2-XC

Hi!

I think that would be pretty tricky on the full object: https://github.com/sstadick/gzp/blob/f31a4dd622d61e3ba9139935fc71644a3f4f1d42/src/par/compress.rs#L150. Specifically I don't think the channels and joinhandle will serialize nicely.

You could theoretically grab the dictionary though. Can you describe more about application errors can be recovered? The threads can crash, but you should be able to get errors out of the joinhandles still. What kind of error are you seeing? Or is it an error elsewhere in your app that that causes the failure and you want the encoder dictionary to pick up where it left off?

sstadick avatar Mar 04 '25 16:03 sstadick

Yeah, I was more so referring to errors in other sections of the application, specifically a long running application that does compression. I want to be able to recreate the state of the compressor even if the entire application crashes and I need to relaunch the binary.

I suppose this is a good explanation of what I had in mind.

Or is it an error elsewhere in your app that that causes the failure and you want the encoder dictionary to pick up where it left off?

Do you think serde is even the right tool for this? Maybe making some of the attributes like the dictionary public and adding something to ParCompressBuilder to allow for setting of the dictionary? I'd love to hear what you think.

Some1and2-XC avatar Mar 04 '25 17:03 Some1and2-XC

Making the dictionary accessible and also allowing a dict to be passed in do seem like the way to go for that.

You may have to just dump the dictionary at some interval, but the Bytes crate has a serde feature, so that should all wire together okay.

I don't think that I'm going to be able to implement this myself anytime soon, but would be totally open to PRs! I think you described the work already:

  • getter for returning an Option<&Bytes> from the ParCompress.
  • update for the ParCompressBuilder to set some Option<Bytes> as the dictionary value, default to None.
  • add serde as a top level feature that in turn sets the serde feature on the Bytes crate.
  • a few tests, gated by the serde feature.

sstadick avatar Mar 04 '25 17:03 sstadick

Okay, sounds great! I'll see what I can do.

Some1and2-XC avatar Mar 04 '25 18:03 Some1and2-XC

@sstadick sorry it took so long. I've had lots of other things to focus on. Here is a link to the diff between my fork and the current repo.

I've implemented several functions into the ParCompressBuilder struct and just wanted to get your thoughts on some API decisions. To be able to partially compress data and come back to it, some flags and other things needed to be added to ParCompressBuilder. Namely:

  • a way of setting the dictionary.
  • a flag for if the header should be written.
  • a flag for if the footer should be written.
  • some way of passing the checksum that has been computed in.
  • some way of getting access to the inner checksum before the compressor is finished and dropped.

Most of these were relatively trivial expect for returning the inner checksum. I used a flume unbounded channel as I saw it was used elsewhere as a "oneshot" channel and made a way of getting such a channel as a member function of ParCompressBuilder.

I haven't touched the serde functionality quite yet however I think this will become a very nice feature that doesn't really exist in any other library. Also the only information needed to serialize a compressor seems to be just the dictionary and checksum which are attributes that are relatively easy to get access to, so long as the new ParCompress::write_footer_on_exit is set to false.

I also haven't tested more than the Zlib implementation even though these were changes that were made to the generic compressor. I wouldn't feel comfortable merging anything until more testing is done on both just that implementation as well as other compressors to see how they handle the new additions.

Let me know your thoughts!

Thanks, Some1and2

Some1and2-XC avatar Apr 17 '25 01:04 Some1and2-XC

Thanks @Some1and2-XC I'll take a look over the weekend and get back to you 👍

sstadick avatar Apr 17 '25 21:04 sstadick

@Some1and2-XC I got to go over what you have so far. I think it seems like an okay direction. Agreed that we'd want to test it on all the different feature flag combos. I'd also like to do some bench-marking when it's closer to final form (nothing crazy, we could use crabz to do it) to make sure there isn't some unforeseen overhead.

Can you tell me more about the overall use case? Even though your changes are really pretty straightforward, I'm reticent to add complexity where things have been pretty stable overall. I'd like to make sure that this use case is broadly applicable still to justify adding more code.

sstadick avatar Apr 21 '25 16:04 sstadick

@sstadick The idea is that you can start/stop compression jobs.

I built a proof-of-concept compression utility that you can ctrl+c half way through compression and continue from where you left off. With this implementation, I found you probably don't actually need access to the dictionary from an outside API perspective (it was breaking the ZLIB stream when I initialized the compressor with it not set to none).

I also finished implementing serde on top of the check trait here though the way I did I might revise because I didn't require serde as a trait bind, instead just using additional functions. Though the way it is now is sufficient.

My real use case is that I want to make the largest image in the world that I have generated even bigger. I've had problems with GPU linux drivers randomly failing and I want a way to recover a compression job without completely starting over (if other failure recovery mechanisms don't work).

This is important because PNG requires a continuous zlib stream.

Also, I was thinking personally that some of the functions to change the compression state might be justified in being flagged as unsafe. They are helpful but seem prone to error if external things aren't handled correctly or if you don't know what you're doing. Would love to hear your thoughts on that.

Thanks, @Some1and2-XC

Some1and2-XC avatar Apr 24 '25 03:04 Some1and2-XC