K4os.Compression.LZ4 Support for `ReadOnlySequence<byte>` and `IBufferWriter<byte>`

I'm happy to see support for ReadOnlySpan<byte> already in the library.

I want to be able to effectively stream compression/decompression, but without using the Stream class. That is, I don't want to have to allocate contiguous memory for the input and output of the algorithm.

If the algorithm would accept a ReadOnlySequence<byte> as input, and write its output to a given IBufferWriter<byte> instance, then I can for instance process 1GB of data without having to have found 2 blocks of 1GB of contiguous memory to allocate arrays. Instead, these two types allow for breaking up the input/output into much more conveniently sized blocks.

Can such support be added? I can already do this on top of your Stream APIs, but doing it natively would presumably be at least a bit more efficient, and possibly(?) avoid the extra dependency you have on your stream-supporting package.

Dec 27 '18 21:12 AArnott

I will take a look. I was thinking about adding some lightweight stream-like abstraction, but considered it not-so-urgent as Stream IS an abstraction.

It won't be trivial though. At any point of time last 64k of uncompressed data needs to be accessible at fixed address in contiguous block of memory.

Jan 02 '19 12:01 MiloszKrajewski

Thanks.

At any point of time last 64k of uncompressed data needs to be accessible at fixed address in contiguous block of memory.

Interesting. ReadOnlySequence<byte> certainly wouldn't satisfy that guarantee, as the sequence could be blocks of any size. If they weren't of the required size you'd have to copy the bytes into a (reusable) 64KB array. I'm curious though why this is a requirement. Is it something we can fix (in a performant way) or is it in some native library that can't change?

Another idea is to still support ReadOnlySequence<byte> but document that unless the blocks are at least 64KB each, that extra copying will occur. At least with the types that I'm using to create these sequences, I can control the block size to ensure they are all at least 64KB.

Jan 02 '19 16:01 AArnott

Please note that having non-contiguous blocks of any size will not guarantee "At any point of time last 64k of uncompressed data needs to be accessible at fixed address in contiguous block of memory.". For example, even with 1MB block after writng 1MB+1byte of data last 64k is no longer in contiguous block of memory". So copying into reusable sliding buffer is the best option. And that's what LZ4ChainDecoder, LZ4ChainFastEncoder and LZ4ChainHighEncoder are doing. The complexity is just hidden.

Jan 02 '19 17:01 MiloszKrajewski

I see. Well, documenting the anticipated perf hit from the extra mem copying from using a ReadOnlySequence<byte> with anything but a single array backing it seems reasonable. I'd still prefer that to the requirement of allocating a single array with an arbitrarily length. But hopefully the perf impact can be relatively minor.

Jan 02 '19 17:01 AArnott

Regarding IBufferWriter, see #47

Dec 29 '20 21:12 geeknoid

Since #47 was reverted, I've tried again with #50 , hopefully it'll work out better this time around :-)

Jan 20 '21 15:01 geeknoid

If have a feeling that by saying "I want to be able to effectively stream compression/decompression, but without using the Stream class" @AArnott meant using ReadOnlySequence and IBufferWriter for streaming protocol, ReadOnlySequence being kind of InputStream and IBufferWriter beging OutputStream.

@geeknoid So your changes to pickler might improve performance (if used properly) but are not addressing the problem stated in this ticket.

The problem is baked into design of "pickled" data - it is self-sufficient encapsulated datagram. I have a problem seeing how those new APIs you wrote could be effectively used.

Jan 23 '21 22:01 MiloszKrajewski

IBufferWriter is the modern efficient way to produce data without requiring the callee to allocate a buffer. Look at how serialization code works in https://github.com/protobuf-net/protobuf-net, https://github.com/neuecc/MessagePack-CSharp, and https://github.com/protocolbuffers/protobuf/tree/master/csharp. The modern additions to these packages all leverage IBufferWriter for the serialization process, and either ReadOnlyMemory or ReadOnlySpan for the deserialization process.

I've composed layers on top of this new pickling code, combined with serialization and encryption. So in one call, the user does:

myCache.Set(myValue);

And internally Set will serialize the value, then compress it, then encrypt it. All without any allocations and any data copying. It starts at the top with a pool of IBufferWriter implementations. The Set call takes a BufferWriter from the pool and passes it to the serializer which outputs into the preallocated buffer. This buffer is given to the compressor which outputs to another pooled BufferWriter. That buffer is then passed to the encryption logic, which also outputs to a pool IBufferWriter. It all composes very well indeed.

So basically, any API where you are tempted to return an array of something to the caller is better implemented by accepting an IBufferWriter where the data is put, and then returning void to the caller.

Jan 25 '21 19:01 geeknoid

Yes, @geeknoid has got the idea for writing. And @MiloszKrajewski is right on.

For reading, ReadOnlyMemory<byte> and ReadOnlySpan<byte> support is good, but ReadOnlySequence<byte> support is even better (perhaps in addition to ReadOnlySpan<byte> support) because now I can deserialize a huge data stream without having it be in one huge block in memory (which especially in a 32-bit process can be hard to come by). A Stream type would theoretically also work, in that it allows a streaming read, but it comes with allocations that we may be able to avoid. And unless your deserializer is async, the Stream's async functions (which might make it preferable over ReadOnlySequence<byte> aren't useful.

From the foregoing discussion, it seems likely easier to compress/decompress into an IBufferWriter<byte> than it is to compress/decompress from a ReadOnlySequence<byte> precisely because there is no data continuity guarantee. But if 64K blocks must be contiguous, it seems that a single 64K block could be allocated (and shared between uses perhaps) and initialized with something like seq.Slice(offset, 64000).CopyTo(_64block) and periodically refilled with later slices. In fact you could even check and avoid the copy when seq.Slice(...).IsSingleSegment == true as you already have contiguous memory you can point directly into.

Jan 26 '21 01:01 AArnott

I was looking at this today. Making compression / decompression to/from IBufferWriter and ReadOnlySequence would require duplication the whole Streaming logic or creating some streaming abstraction and having two implementations for stream and ReadOnlySequence. That is a little bit too much for this stage.

I was thinking about creating an adapter to make ReadOnlySequence a Stream. Yes, this solution comes with one level of redirection and additional memory allocations, but it would provide an interface which can be used while waiting for better implementation.

This felt like a relatively simple thing to do, so I decided to check if it doesn't exist already... And, yup, I've found it - You wrote it :-)

For those who came here for solution here a link to @AArnott library: https://github.com/AArnott/Nerdbank.Streams

So I'm parking this ticket, for now. I do understand it could be much faster with native support but for now your library at least enables such usage.

Feb 06 '21 18:02 MiloszKrajewski

I have made a prototype of this here: https://gist.github.com/rmja/98dc7e0576c933faa0a75629b46af71c Let me know if you want this in a PR somewhere.

Sep 02 '22 04:09 rmja

@rmja I'd like to consume this in MessagePack-CSharp, perhaps among other places. I'm not sure at this point where the LZ4 implementation that's copied as source into messagepack came from. Maybe from this repo? Anyway, if you have the time to look it over and possibly send a PR there, I'd love to discuss it with you. Maybe open an issue against the messagepack github repo so we can talk details?

Ideally your code can merge into this repo as well to help many others.

Sep 04 '22 15:09 AArnott

Year, It would be really nice to have it here. @MiloszKrajewski What would your proposal be - should we create a K4os.Compression.LZ4.Buffers project? Alternatively it could be put in the core project but that would add a dependency on K4os.Hash.xxHash.

Sep 05 '22 06:09 rmja

I am working on generic approach, which is taking some time.

Generic, so it won't need separate implementation for other stream-like objects (and anything remotely looking like .GimmeNextBytes(N) and .HereAreSomeBytes(B) are streams, right?).
Generic, so it might be slower than yours.

I will return to it this weekend, I guess, I will finish it and do some performance testing. Depending how much slower my generic solution will be (I assume it must be slower, ust how much) I will make a call: generic or dedicated.

Sep 07 '22 18:09 MiloszKrajewski

@MiloszKrajewski An API that fits your two methods is PipeReader. If the LZ4 algorithm could operate based on that (and PipeWriter it would be able to interop with many modern APIs and be truly 'streaming'. But these APIs include async calls (ReadAsync and FlushAsync), and you'd want to bubble that up to the top-level public APIs.

Sep 07 '22 19:09 AArnott

I do not think that we need a home made generic abstraction. The implementation on top of ILZ4Decoder, etc is straight forward. Combine that with a Pipe, and it could maybe be the shim for the streams?

Sep 07 '22 19:09 rmja

@MiloszKrajewski An API that fits your two methods is PipeReader. If the LZ4 algorithm could operate based on that (and PipeWriter it would be able to interop with many modern APIs and be truly 'streaming'. But these APIs include async calls (ReadAsync and FlushAsync), and you'd want to bubble that up to the top-level public APIs.

PipeReader sounds exactly what I need, but with 3 "buts":

I need it in .NET Standard 1.6 (am I ready to drop compatibility, maybe... I'm not sure),
it is relatively big (lots of method to fill, while I need 1 or 2)
I need Stream to be also PipeReader (and I guess PipeWriter)

As bad as it sounds "homemade" is the current state of the game.

public interface IStreamWriter
{
   int Write(byte[] buffer, int offset, int length);
   ValueTask<int> WriteAsync(byte[] buffer, int offset, int length, CancellationToken token);
   void AfterWrite(int written);
}

NOTE: it is an interface, but designed to be implemented as struct (therefore AfterWrite to allow post-WriteAsync updates, as struct and async don't play together well).

Sep 07 '22 21:09 MiloszKrajewski

I need it in .NET Standard 1.6

There is no Microsoft-supported runtime that supports .NET Standard 1.6 that doesn't also support .NET Standard 2.0 at this point. Check out this table. So AFAIK there is absolutely no reason to target anything less than .NET Standard 2.0.

I need Stream to be also PipeReader (and I guess PipeWriter)

I don't understand this point. If you mean you need an adapter between the PipeReader and Stream, those can be made. And in fact the Nerdbank.Streams nuget package (that I own) already implements these adapters. System.IO.Pipelines itself comes with a stream-as-PipeReader adapter nowadays too.

As bad as it sounds "homemade" is the current state of the game.

How is your IStreamWriter better than IBufferWriter<byte>? IMO it must be substantially better to justify not reusing a type that is already prolific in the ecosystem. In MessagePack we use a struct as the main currency for writing, but internally that struct has a field that is IBufferWriter<byte>. This allows reuse of memory for amortized alloc-free writing, while still leveraging a common interface type that folks are likely already using. I don't understand how AfterWrite is useful. And I'm not sure why you need Write and WriteAsync on the same interface when the interface is new, unless perhaps you'll have a caller that is synchronous. But it begs the question how can the same 'struct' that implements that interface properly support both.

Sep 07 '22 22:09 AArnott

@AArnott @MiloszKrajewski I have updated the gist to include the following method in the decoder:

ValueTask DecodeAsync(PipeReader compressed, IBufferWriter<byte> decompressed, CancellationToken cancellationToken = default)

It now fully supports streaming through a Pipe.

Sep 13 '22 09:09 rmja

@rmja, that looks terrific. Since we're already willing to be async, can we switch from IBufferWriter<byte> to PipeWriter and call await FlushAsync periodically? That way, we don't have to allocate memory for the entire decompressed data. Getting to IBufferWriter<byte> was a good start, since it means we don't have to allocate contiguous memory for the entire data, but allowing time to flush the decompressed data to disk (or the network) during decompression can be a win as well.

Sep 13 '22 15:09 AArnott

I will do that. Wouldn't it be a good place to put this in the streams lib and then maybe wrap the stream implementation around this? Just to get an idea about how to proceed with a PR.

Sep 13 '22 15:09 rmja

put this in the streams lib

Are you talking about Nerdbank.Streams? If so, that library has no dependency on LZ4, and many users that wouldn't want that dependency. Getting this into MessagePack would be very interesting, and I'd like to do that regardless. But its LZ4 implementation will be internal, and anyway that lib is scoped to messagepack users. So the best place for your code IMO is either this library, or if that can't happen, maybe a new library?

Sep 13 '22 16:09 AArnott

I was talking about https://github.com/MiloszKrajewski/K4os.Compression.LZ4/tree/master/src/K4os.Compression.LZ4.Streams 😁

Sep 13 '22 16:09 rmja

My idea is that this pipe work maybe could replace the current internals og that lib, and then expose both the pipe fundamentals and the current streams wrapped with a pipe.

Sep 13 '22 16:09 rmja

Ah, ok. That sounds great. I'll have to let @MiloszKrajewski decide on that.

the current streams wrapped with a pipe.

Or possibly vice versa: implement with a pipe at its core, and wrap with streams to support the existing Stream-based API. Pipes can be used with higher perf than streams, so this might allow you to maintain fewer copies of the code while keeping perf at or near max for everyone.

Sep 13 '22 16:09 AArnott

Year, exactly. That was my point. @MiloszKrajewski is it ok if I come with a proposal for this that we can use as a basis for a discussion? I think that work from here on is better in a draft pr than a gist.

Sep 13 '22 16:09 rmja

@AArnott @MiloszKrajewski I have now added https://github.com/MiloszKrajewski/K4os.Compression.LZ4/pull/74 which should be somewhat feature complete. Let me know what you think.

Sep 15 '22 03:09 rmja

I think a lot of it is addressed in 1.3.0-beta. I know you did a lot of good work in #74 but you caught in the middle of a very long feature branch and major refactor. Check how new APIs fit your purpose. A lot of changes around streaming from different sources and to different targets (Span, Sequence, Memory, BufferWriter, Pipe, etc).

Nov 04 '22 22:11 MiloszKrajewski

I'm closing this issue at addressed with 1.3.3. This ticket was very valuable, and I guess it can spawn a discussion about rewrite. But for now: it does support ReadOnlySequence<byte> and IBufferWriter<byte> as source/target on encoding/decoding, so goal has been achieved.

Nov 06 '22 21:11 MiloszKrajewski

@MiloszKrajewski I cannot really verify your work for now in my project as I rely on support for contentsize, block and content checksum. I will try and evaluate when support for those get in.

Nov 07 '22 06:11 rmja