Implement rudimentary compression support
Pretty much what the title says.
This is far from complete, right now I only have
- A bitwriter
- Frame header encoding
- Partial (nonfunctional) block header support
I don't know what the final interface is going to look like, but my series of goals looks roughly like:
-
[x] Finish up block header
-
[x] Implement raw blocks
-
[ ] At this point, add tests that integrate with the original zstd library to make sure the frame will decompress properly
-
~~At this point, either reverse engineer the algorithms for the various compression levels (using the original and Go implementations as a reference), or add support for creating the rest of the block kinds.~~
This is mostly so that you can track my progress if you wish
Very exciting to see some activity in this direction!
I realize this is pretty early but I have some experience in efficient encoding/decoding stuff:
I see you have a few functions following this pattern
pub fn serialize(self) -> Result<Vec<u8>, FrameHeaderError>
I would suggest to do it this way:
pub fn serialize(self, buf: &mut Vec<u8>) -> Result<(), FrameHeaderError>
The constant small allocations of serializing to a vec, returning it and concatenating them in one big buffer can get very expensive.
Aside from the small nitpick:
I think the last point in your series of goals is in itself a series of goals. The algorithms and especially the tuning for the different compression levels are a science on their own.
What I'd define as a good goal for this PR is up to and including raw blocks + testing/fuzzing against original zstd. This would allow full albeit inefiicient interop with other systems using the original/other zstd implementations.
Thanks for your feedback :)
- The creation of vectors was just sort of a quick and dirty way to work towards a functional example, after everything was working, I figured I could begin experimenting with passing mutable references and/or arena allocation with something like bumpalo. You're definitely correct though, and I probably should have just started with passing mutable vecs
- Yeah, that should definitely be a separate PR, I'm expecting it to be significantly more work. I suck at reading the zstd code base, and I was never able to gain a great understanding of FSE encoding. I was originally planning to try and write my own algorithms for reaching various compression levels, but when I saw that the Go re-implementation of zstd just replicated the original source code, I figured I should do the same.
I'll aim for that goal, and try to keep you updated
@KillingSpark
Just an update, I've discovered a distinction between the original decoder and yours when using my encoder.
When decoding with the original implementation, it fails on ./decodecorpus_files/z000002.
---- tests::encode_corpus::test_encode_corpus_files_uncompressed stdout ----
Trying file: "./decodecorpus_files/abc.txt"
Trying file: "./decodecorpus_files/z000000"
Trying file: "./decodecorpus_files/z000001"
Trying file: "./decodecorpus_files/z000002"
thread 'tests::encode_corpus::test_encode_corpus_files_uncompressed' panicked at src/tests/encode_corpus.rs:51:89:
Whereas your implementation makes it all the way to ./decodecorpus_files/z000062.
running 1 test
test tests::encode_corpus::test_encode_corpus_files_uncompressed ... FAILED
failures:
---- tests::encode_corpus::test_encode_corpus_files_uncompressed stdout ----
Trying file: "./decodecorpus_files/abc.txt"
Trying file: "./decodecorpus_files/z000000"
Trying file: "./decodecorpus_files/z000001"
Trying file: "./decodecorpus_files/z000002"
Trying file: "./decodecorpus_files/z000003"
Trying file: "./decodecorpus_files/z000004"
Trying file: "./decodecorpus_files/z000005"
Trying file: "./decodecorpus_files/z000006"
Trying file: "./decodecorpus_files/z000007"
Trying file: "./decodecorpus_files/z000008"
Trying file: "./decodecorpus_files/z000009"
Trying file: "./decodecorpus_files/z000010"
Trying file: "./decodecorpus_files/z000011"
Trying file: "./decodecorpus_files/z000012"
Trying file: "./decodecorpus_files/z000013"
Trying file: "./decodecorpus_files/z000014"
Trying file: "./decodecorpus_files/z000015"
Trying file: "./decodecorpus_files/z000016"
Trying file: "./decodecorpus_files/z000017"
Trying file: "./decodecorpus_files/z000018"
Trying file: "./decodecorpus_files/z000019"
Trying file: "./decodecorpus_files/z000020"
Trying file: "./decodecorpus_files/z000021"
Trying file: "./decodecorpus_files/z000022"
Trying file: "./decodecorpus_files/z000023"
Trying file: "./decodecorpus_files/z000024"
Trying file: "./decodecorpus_files/z000025"
Trying file: "./decodecorpus_files/z000026"
Trying file: "./decodecorpus_files/z000027"
Trying file: "./decodecorpus_files/z000028"
Trying file: "./decodecorpus_files/z000029"
Trying file: "./decodecorpus_files/z000030"
Trying file: "./decodecorpus_files/z000031"
Trying file: "./decodecorpus_files/z000032"
Trying file: "./decodecorpus_files/z000033"
Trying file: "./decodecorpus_files/z000034"
Trying file: "./decodecorpus_files/z000035"
Trying file: "./decodecorpus_files/z000036"
Trying file: "./decodecorpus_files/z000037"
Trying file: "./decodecorpus_files/z000038"
Trying file: "./decodecorpus_files/z000039"
Trying file: "./decodecorpus_files/z000040"
Trying file: "./decodecorpus_files/z000041"
Trying file: "./decodecorpus_files/z000042"
Trying file: "./decodecorpus_files/z000043"
Trying file: "./decodecorpus_files/z000044"
Trying file: "./decodecorpus_files/z000045"
Trying file: "./decodecorpus_files/z000046"
Trying file: "./decodecorpus_files/z000047"
Trying file: "./decodecorpus_files/z000048"
Trying file: "./decodecorpus_files/z000049"
Trying file: "./decodecorpus_files/z000050"
Trying file: "./decodecorpus_files/z000051"
Trying file: "./decodecorpus_files/z000052"
Trying file: "./decodecorpus_files/z000053"
Trying file: "./decodecorpus_files/z000054"
Trying file: "./decodecorpus_files/z000055"
Trying file: "./decodecorpus_files/z000056"
Trying file: "./decodecorpus_files/z000057"
Trying file: "./decodecorpus_files/z000058"
Trying file: "./decodecorpus_files/z000059"
Trying file: "./decodecorpus_files/z000060"
Trying file: "./decodecorpus_files/z000061"
Trying file: "./decodecorpus_files/z000062"
thread 'tests::encode_corpus::test_encode_corpus_files_uncompressed' panicked at src/tests/encode_corpus.rs:53:55:
called `Result::unwrap()` on an `Err` value: Custom { kind: Other, error: FailedToReadBlockHeader(ReadError(Error { kind: UnexpectedEof, message: "failed to fill whole buffer" })) }
I'm yet to investigate this very far, I'll keep you updated
That is definitely interesting! If you need any help I'd be glad to assist. These kinds of differences might indicate a bug
https://www.reddit.com/r/programming/comments/1evhaco/popular_introduction_to_huffman_arithmetic_ans/ This is a personal note for me i would like to look at this later
Also this https://fastcompression.blogspot.com/2014/01/fse-decoding-how-it-works.html
Small progress update, all of the encode corpus tests for the (uncompressed) compressor are passing with the rust implementation, but fail on the original implementation.
All that was required from the initial change to make every test pass with the rust decompressor was adding edge case handling for compressing a completely empty frame.
After some test refactoring so I can better see every test that fails,
Decompression of the compressed file fails on the following files: [("./decodecorpus_files/z000002", "Decompressor threw an error: Custom { kind: Other, error: \"Data corruption detected\" }")]
It appears that decompression is only failing on a single file 🎊. I'll look into it and see if I can figure it out.
That's great!
Ok, at this point I feel like I've exhausted my knowledge of simpler things that may be causing the issue, any help would be greatly appreciated :)
I'll take a look this weekend :)
I think I found the error. Your code seems to ignore this bit of the standard (https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#frame_content_size):
Frame_Content_Size format is little-endian. When FCS_Field_Size is 1, 4 or 8 bytes, the value is read directly. **When FCS_Field_Size is 2, the offset of 256 is added**. It's allowed to represent a small size (for example 18) using any compatible variant.
This file is the first to hit the size range that would fit into two bytes.
That looks like it's it, thank you so much for the help :)
I'll work on implementing a fix, although it's currently being a migraine and a half.
yeah zstd has some weird edges like this to save a few more bytes here and there
This is amazing! I've been wanting to see a zstd encoder in Rust for a while know, sometimes I thought at some point I would dive into it myself, but I figured that is not really my area of expertise.
Ok, so I've done everything I set out to do in the initial PR, @KillingSpark
Thanks for your feedback so far, it's been a tremendous help :)
Very nice!
Just one thing that came to my mind: I think it would be pretty cool to expand the interop fuzzer to additionally use this encoder
Done :)
I should note, the error handling is pretty awful and I don't quite know how to do it correctly
I'm not sure what error handling you mean? If you mean the error from creating the frame descriptor I think you should check the descriptor before creating the encoder so the encoder can assume a valid descriptor?
I don't really propagate errors through the code because I didn't really know how, EG if function A calls function B, I don't really propagate B's error out, I call .unwrap().
Normally error propagation is done via the ? operator. If you define your own errors there need to be From/Into implementations for the error type that A returns that convert an error from B to that type so that the ? can work properly
Oh, I just realized I think I could have one of the fields of the outer error enum be the inner error enum, or maybe a box/dyn Error
Once this is done, I think I'd like to work on an organizational refactor of the whole library from the outside in. I think the file structure could be re-organized to better handle both compression/decompression, and I want to reorganize the externally facing module tree, because I think the current interface is a little confusing.
When I take this on, I think I'll approach the module re-org by opening the docs.rs page and seeing how it would look from a perspective of someone who knew nothing about zstd and just wanted to compress/decompress a file or a stream.
I don't quite know how I'd like to shuffle files around to clean things up, I'll need to think more about that.
Better handling is done, that CI failing is from an unrelated thing
So this looks like a solid first step towards including basic but valid encoding into this project. There is always room for improvement but I think the API for the FrameEncoder looks good. Do you have any changes you want to include in this PR? Otherwise I'd go ahead and merge this :)
The readme needs to be updated, I'll do that then it should be good to go!
@KillingSpark done, should be ready to merge :)
Hm actually this just came to my mind:
The only place that currently creates an error is the conversion of the frame header to its serialized representation because the validity is checked there. But the validity could be checked before serialization. Maybe the frameencoder could get a frameheader as a parameter and it only accepts valid frameheaders? Then serialization would be able to always succeed
Because I believe that more errors will pop up as more of compression is implemented, I'd like to leave serialization as fallible, so then any more errors won't break the external interface.
We could probably implement some form of constant assertion that verifies that though if the provided frame header is 'static (it should be most of the time).
I'd like to keep the external interface for compression as simple as possible, users of the library shouldn't need to understand any internals of zstandard, they can just ask it to compress a blob/stream, and it gives them a stream/blob in return.