zstd-rs icon indicating copy to clipboard operation
zstd-rs copied to clipboard

Implement rudimentary compression support

Open zleyyij opened this issue 1 year ago • 5 comments

Pretty much what the title says.

This is far from complete, right now I only have

  • A bitwriter
  • Frame header encoding
  • Partial (nonfunctional) block header support

I don't know what the final interface is going to look like, but my series of goals looks roughly like:

  • [x] Finish up block header

  • [x] Implement raw blocks

  • [ ] At this point, add tests that integrate with the original zstd library to make sure the frame will decompress properly

  • ~~At this point, either reverse engineer the algorithms for the various compression levels (using the original and Go implementations as a reference), or add support for creating the rest of the block kinds.~~

This is mostly so that you can track my progress if you wish

zleyyij avatar Aug 01 '24 21:08 zleyyij

Very exciting to see some activity in this direction!

I realize this is pretty early but I have some experience in efficient encoding/decoding stuff:

I see you have a few functions following this pattern

pub fn serialize(self) -> Result<Vec<u8>, FrameHeaderError>

I would suggest to do it this way:

pub fn serialize(self, buf: &mut Vec<u8>) -> Result<(), FrameHeaderError>

The constant small allocations of serializing to a vec, returning it and concatenating them in one big buffer can get very expensive.

Aside from the small nitpick:

I think the last point in your series of goals is in itself a series of goals. The algorithms and especially the tuning for the different compression levels are a science on their own.

What I'd define as a good goal for this PR is up to and including raw blocks + testing/fuzzing against original zstd. This would allow full albeit inefiicient interop with other systems using the original/other zstd implementations.

KillingSpark avatar Aug 02 '24 11:08 KillingSpark

Thanks for your feedback :)

  • The creation of vectors was just sort of a quick and dirty way to work towards a functional example, after everything was working, I figured I could begin experimenting with passing mutable references and/or arena allocation with something like bumpalo. You're definitely correct though, and I probably should have just started with passing mutable vecs
  • Yeah, that should definitely be a separate PR, I'm expecting it to be significantly more work. I suck at reading the zstd code base, and I was never able to gain a great understanding of FSE encoding. I was originally planning to try and write my own algorithms for reaching various compression levels, but when I saw that the Go re-implementation of zstd just replicated the original source code, I figured I should do the same.

I'll aim for that goal, and try to keep you updated

zleyyij avatar Aug 02 '24 13:08 zleyyij

@KillingSpark

Just an update, I've discovered a distinction between the original decoder and yours when using my encoder.

When decoding with the original implementation, it fails on ./decodecorpus_files/z000002.

---- tests::encode_corpus::test_encode_corpus_files_uncompressed stdout ----
Trying file: "./decodecorpus_files/abc.txt"
Trying file: "./decodecorpus_files/z000000"
Trying file: "./decodecorpus_files/z000001"
Trying file: "./decodecorpus_files/z000002"
thread 'tests::encode_corpus::test_encode_corpus_files_uncompressed' panicked at src/tests/encode_corpus.rs:51:89:

Whereas your implementation makes it all the way to ./decodecorpus_files/z000062.

running 1 test
test tests::encode_corpus::test_encode_corpus_files_uncompressed ... FAILED

failures:

---- tests::encode_corpus::test_encode_corpus_files_uncompressed stdout ----
Trying file: "./decodecorpus_files/abc.txt"
Trying file: "./decodecorpus_files/z000000"
Trying file: "./decodecorpus_files/z000001"
Trying file: "./decodecorpus_files/z000002"
Trying file: "./decodecorpus_files/z000003"
Trying file: "./decodecorpus_files/z000004"
Trying file: "./decodecorpus_files/z000005"
Trying file: "./decodecorpus_files/z000006"
Trying file: "./decodecorpus_files/z000007"
Trying file: "./decodecorpus_files/z000008"
Trying file: "./decodecorpus_files/z000009"
Trying file: "./decodecorpus_files/z000010"
Trying file: "./decodecorpus_files/z000011"
Trying file: "./decodecorpus_files/z000012"
Trying file: "./decodecorpus_files/z000013"
Trying file: "./decodecorpus_files/z000014"
Trying file: "./decodecorpus_files/z000015"
Trying file: "./decodecorpus_files/z000016"
Trying file: "./decodecorpus_files/z000017"
Trying file: "./decodecorpus_files/z000018"
Trying file: "./decodecorpus_files/z000019"
Trying file: "./decodecorpus_files/z000020"
Trying file: "./decodecorpus_files/z000021"
Trying file: "./decodecorpus_files/z000022"
Trying file: "./decodecorpus_files/z000023"
Trying file: "./decodecorpus_files/z000024"
Trying file: "./decodecorpus_files/z000025"
Trying file: "./decodecorpus_files/z000026"
Trying file: "./decodecorpus_files/z000027"
Trying file: "./decodecorpus_files/z000028"
Trying file: "./decodecorpus_files/z000029"
Trying file: "./decodecorpus_files/z000030"
Trying file: "./decodecorpus_files/z000031"
Trying file: "./decodecorpus_files/z000032"
Trying file: "./decodecorpus_files/z000033"
Trying file: "./decodecorpus_files/z000034"
Trying file: "./decodecorpus_files/z000035"
Trying file: "./decodecorpus_files/z000036"
Trying file: "./decodecorpus_files/z000037"
Trying file: "./decodecorpus_files/z000038"
Trying file: "./decodecorpus_files/z000039"
Trying file: "./decodecorpus_files/z000040"
Trying file: "./decodecorpus_files/z000041"
Trying file: "./decodecorpus_files/z000042"
Trying file: "./decodecorpus_files/z000043"
Trying file: "./decodecorpus_files/z000044"
Trying file: "./decodecorpus_files/z000045"
Trying file: "./decodecorpus_files/z000046"
Trying file: "./decodecorpus_files/z000047"
Trying file: "./decodecorpus_files/z000048"
Trying file: "./decodecorpus_files/z000049"
Trying file: "./decodecorpus_files/z000050"
Trying file: "./decodecorpus_files/z000051"
Trying file: "./decodecorpus_files/z000052"
Trying file: "./decodecorpus_files/z000053"
Trying file: "./decodecorpus_files/z000054"
Trying file: "./decodecorpus_files/z000055"
Trying file: "./decodecorpus_files/z000056"
Trying file: "./decodecorpus_files/z000057"
Trying file: "./decodecorpus_files/z000058"
Trying file: "./decodecorpus_files/z000059"
Trying file: "./decodecorpus_files/z000060"
Trying file: "./decodecorpus_files/z000061"
Trying file: "./decodecorpus_files/z000062"
thread 'tests::encode_corpus::test_encode_corpus_files_uncompressed' panicked at src/tests/encode_corpus.rs:53:55:
called `Result::unwrap()` on an `Err` value: Custom { kind: Other, error: FailedToReadBlockHeader(ReadError(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" })) }

I'm yet to investigate this very far, I'll keep you updated

zleyyij avatar Aug 14 '24 15:08 zleyyij

That is definitely interesting! If you need any help I'd be glad to assist. These kinds of differences might indicate a bug

KillingSpark avatar Aug 14 '24 17:08 KillingSpark

https://www.reddit.com/r/programming/comments/1evhaco/popular_introduction_to_huffman_arithmetic_ans/ This is a personal note for me i would like to look at this later

Also this https://fastcompression.blogspot.com/2014/01/fse-decoding-how-it-works.html

zleyyij avatar Aug 20 '24 01:08 zleyyij

Small progress update, all of the encode corpus tests for the (uncompressed) compressor are passing with the rust implementation, but fail on the original implementation.

All that was required from the initial change to make every test pass with the rust decompressor was adding edge case handling for compressing a completely empty frame.

zleyyij avatar Aug 28 '24 17:08 zleyyij

After some test refactoring so I can better see every test that fails,

Decompression of the compressed file fails on the following files: [("./decodecorpus_files/z000002", "Decompressor threw an error: Custom { kind: Other, error: \"Data corruption detected\" }")]

It appears that decompression is only failing on a single file 🎊. I'll look into it and see if I can figure it out.

zleyyij avatar Aug 28 '24 18:08 zleyyij

That's great!

KillingSpark avatar Sep 01 '24 13:09 KillingSpark

Ok, at this point I feel like I've exhausted my knowledge of simpler things that may be causing the issue, any help would be greatly appreciated :)

zleyyij avatar Sep 19 '24 16:09 zleyyij

I'll take a look this weekend :)

KillingSpark avatar Sep 20 '24 06:09 KillingSpark

I think I found the error. Your code seems to ignore this bit of the standard (https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#frame_content_size):

Frame_Content_Size format is little-endian. When FCS_Field_Size is 1, 4 or 8 bytes, the value is read directly. **When FCS_Field_Size is 2, the offset of 256 is added**. It's allowed to represent a small size (for example 18) using any compatible variant.

This file is the first to hit the size range that would fit into two bytes.

KillingSpark avatar Sep 21 '24 13:09 KillingSpark

That looks like it's it, thank you so much for the help :)

I'll work on implementing a fix, although it's currently being a migraine and a half.

zleyyij avatar Sep 22 '24 03:09 zleyyij

yeah zstd has some weird edges like this to save a few more bytes here and there

KillingSpark avatar Sep 22 '24 08:09 KillingSpark

This is amazing! I've been wanting to see a zstd encoder in Rust for a while know, sometimes I thought at some point I would dive into it myself, but I figured that is not really my area of expertise.

marvin-j97 avatar Sep 23 '24 20:09 marvin-j97

Ok, so I've done everything I set out to do in the initial PR, @KillingSpark

Thanks for your feedback so far, it's been a tremendous help :)

zleyyij avatar Sep 23 '24 21:09 zleyyij

Very nice!

KillingSpark avatar Sep 24 '24 04:09 KillingSpark

Just one thing that came to my mind: I think it would be pretty cool to expand the interop fuzzer to additionally use this encoder

KillingSpark avatar Sep 24 '24 04:09 KillingSpark

Done :)

zleyyij avatar Sep 24 '24 18:09 zleyyij

I should note, the error handling is pretty awful and I don't quite know how to do it correctly

zleyyij avatar Sep 25 '24 02:09 zleyyij

I'm not sure what error handling you mean? If you mean the error from creating the frame descriptor I think you should check the descriptor before creating the encoder so the encoder can assume a valid descriptor?

KillingSpark avatar Sep 25 '24 15:09 KillingSpark

I don't really propagate errors through the code because I didn't really know how, EG if function A calls function B, I don't really propagate B's error out, I call .unwrap().

zleyyij avatar Sep 25 '24 15:09 zleyyij

Normally error propagation is done via the ? operator. If you define your own errors there need to be From/Into implementations for the error type that A returns that convert an error from B to that type so that the ? can work properly

KillingSpark avatar Sep 27 '24 09:09 KillingSpark

Oh, I just realized I think I could have one of the fields of the outer error enum be the inner error enum, or maybe a box/dyn Error

zleyyij avatar Sep 29 '24 19:09 zleyyij

Once this is done, I think I'd like to work on an organizational refactor of the whole library from the outside in. I think the file structure could be re-organized to better handle both compression/decompression, and I want to reorganize the externally facing module tree, because I think the current interface is a little confusing.

When I take this on, I think I'll approach the module re-org by opening the docs.rs page and seeing how it would look from a perspective of someone who knew nothing about zstd and just wanted to compress/decompress a file or a stream.

I don't quite know how I'd like to shuffle files around to clean things up, I'll need to think more about that.

zleyyij avatar Sep 29 '24 21:09 zleyyij

Better handling is done, that CI failing is from an unrelated thing

zleyyij avatar Sep 29 '24 22:09 zleyyij

So this looks like a solid first step towards including basic but valid encoding into this project. There is always room for improvement but I think the API for the FrameEncoder looks good. Do you have any changes you want to include in this PR? Otherwise I'd go ahead and merge this :)

KillingSpark avatar Oct 02 '24 16:10 KillingSpark

The readme needs to be updated, I'll do that then it should be good to go!

zleyyij avatar Oct 02 '24 16:10 zleyyij

@KillingSpark done, should be ready to merge :)

zleyyij avatar Oct 02 '24 16:10 zleyyij

Hm actually this just came to my mind:

The only place that currently creates an error is the conversion of the frame header to its serialized representation because the validity is checked there. But the validity could be checked before serialization. Maybe the frameencoder could get a frameheader as a parameter and it only accepts valid frameheaders? Then serialization would be able to always succeed

KillingSpark avatar Oct 02 '24 17:10 KillingSpark

Because I believe that more errors will pop up as more of compression is implemented, I'd like to leave serialization as fallible, so then any more errors won't break the external interface.

We could probably implement some form of constant assertion that verifies that though if the provided frame header is 'static (it should be most of the time).

I'd like to keep the external interface for compression as simple as possible, users of the library shouldn't need to understand any internals of zstandard, they can just ask it to compress a blob/stream, and it gives them a stream/blob in return.

zleyyij avatar Oct 02 '24 17:10 zleyyij