lz4_stream Some points I found while reading the code

Some points I found while reading the code

Open nh2 opened this issue 4 years ago • 3 comments

Hi, thanks for your project!

I considered using it but decided to pipe out to the lz4 utility instead (for now). Yet, I wanted to write downs some points so that the time I spent reading its code does not go unused.

(I maintain the Haskell binding to lz4frame.h and thus am familiar with the fact that the lz4 API can be tricky, and I'm also interested in double-checking whether my understanding is correct to improve my own library if they aren't.)

These are the points/questions:

Potential memory leak: The constructor output_buffer(std::ostream &sink) seems to be able to throw in the last function it calls, write_header();. When that happens, LZ4F_createCompressionContext() will have malloc'ed a ctx_ and the destructor will not be called to free it. Now I'm not very sure whether write_header() will throw beyond the explicit throw that the library does, which should not happen unless there's a bug. Is it possible for sink_.write() to throw in any situation, or will failures (such as bad_alloc, disk full, TCP client closed the connection of the downstream stream) always be silent? In any case, it may make sense to make a guaranteed non-throwing write_header() for use in the constructor, and ensure all control flow paths out of the constructor past the ctx_ creation call LZ4F_freeCompressionContext(ctx_).
The implementation seems to do a multiple function calls and a stream yield for each incoming byte (overflow() uses pbump(1)), which seems extremely inefficient given that overflow() is virtual. Is this correct, or am I missing something?
The 256 bytes buffer by default seems quite small. What motivated that value?
In this line https://github.com/laudrup/lz4_stream/blob/6b015cbe786291733b6f3aa03e45d307bc4ae527/include/lz4_stream.h#L120 it's using sink_.write() even though the output buffer may be empty (somewhat inefficient), because the lz4 context has an internal buffer. Would it make sene to skip the write() in that case?
In this line https://github.com/laudrup/lz4_stream/blob/6b015cbe786291733b6f3aa03e45d307bc4ae527/include/lz4_stream.h#L65 it might be beneficial to + LZ4F_HEADER_SIZE_MAX for efficiency, to ensure both the header and the first chunk fit into the allocated buffer (like I do in Haskell here).
The examples like lz4_compress.cpp don't do any error handling, like complaining when the input file does not exist or when the output file cannot be written to. They exhibit somewhat funny behaviour, such as that ./lz4_compress nonexistent outfile generates some bytes into out even though the input file does not exist, and it starts reading the input file even after the openat() syscall has already failed (checked with strace). While that's not a real problem, it can be confusing if you're playing the the example and thinking that it's working, when silent failure is actually happening.

Thanks for taking a look!

Feb 11 '20 23:02 nh2

Hi Niklas,

Thanks a lot for your comments. It's very nice to have someone look at your code especially when they can come with such useful comments like you do.

That's is correct and something I remember I wanted to look into but never got around to it. It might be possible to fix by wrapping ctx_ in a unique_ptr, but I would have to look into the details.
I don't think you're missing anything. It does indeed sound very inefficient which is especially problematic considering lz4 primary design goal i performance. That should definitely be fixed.
Nothing motivated that value. I have no idea why I chose that size, no benchmarking or anything has been done to chose that.
Sounds reasonable. Would be extra nice if the unittest could be extended to handle that case.
Also sounds very reasonable.
I guess I just wanted the example programs to be as simple as possible, but you are quite right that they might be a bit too simple :-) Failing silently is never good.

Thanks once again for some very, very useful comments and thanks even more for providing a pull request (which I'll look into shortly).

I cannot promise when I'll have time to look into it, so I'll be very happy to receive more pull requests.

Kind regards,

Kasper

Feb 15 '20 10:02 laudrup

Thanks for you reply!

It might be possible to fix by wrapping ctx_ in a unique_ptr, but I would have to look into the details.

When I made PR #6 I also first tried with unique_ptr, but I realised it does not really work nicely because the LZ4F_decompressionContext_t is already the pointer type and the T in unique_ptr<T> would have to contain what that's pointed to (struct LZ4F_dctx_s), which would mean operating on things that don't look like part of the familar public LZ4 API. So I decided to do the simple explicit RAII approach instead with the added decompression_context.

I don't think you're missing anything. It does indeed sound very inefficient which is especially problematic considering lz4 primary design goal i performance. That should definitely be fixed.

Do you have a good resource to learn exhaustively about how C++ streams work? I understand the part that lz4_stream currently uses, but in a somewhat related project, where we try to combine streams of boost_archive and boost_process to pipe a serialisation stream into a stream that spawns the lz4 command line utility, we get odd exceptions in sync() and strace shows a write(fd=-1, "", count=0) = EBADF syscall, which seems clearly wrong. Most resources I've found so far just talk about stream basics and do not fully describe how one writes a very good, high-performant stream implementation.

Nothing motivated that value. I have no idea why I chose that size, no benchmarking or anything has been done to chose that.

OK. That may be interesting to look into in the future.

There are 2 things to consider:

The buffer size interfacing with the C++ streaming framework.
The buffer size interfacing with lz4.

You currently use SrcBufSize = 256 for both; it determines the src_buf_.size() which is then given to both setp() and dest_buf_(LZ4F_compressBound(src_buf_.size(), ...)). It would theoretically also be possible to decouple the two, using different sizes for those two concepts, but I don't know whether it would be beneficial.

For (1), the main data I've found so far is https://stackoverflow.com/questions/12997131/stdfstream-buffering-vs-manual-buffering-why-10x-gain-with-manual-buffering, where the second graph shows that < 1024 is very fast. One of the answers explains that this is due to hardcoded constants and the syscall implementaiton in the fstream implementation; though I do not fully follow the second answer (it suggests that the way writev() is used changes but doesn't really seem clear to me, nor could I find it in the code).

For (2), I don't know yet how the lz4 performance changes with buffer size and what the call overhead is, but I imagine that call overhead can be significant at 256 bytes buffers because the function usually lives in liblz4.so so no inlining is done and it's a real jump.

Sounds reasonable. Would be extra nice if the unittest could be extended to handle that case.

:heavy_check_mark:

Also sounds very reasonable.

:heavy_check_mark:

I guess I just wanted the example programs to be as simple as possible, but you are quite right that they might be a bit too simple :-) Failing silently is never good.

PR #6 fixes the silent failures by use of stream.exceptions(badbit).

It's still not great because C++'s stream exception handling is pretty poor on error reporting, e.g. a no space left on device manifests only as:

 % ./lz4_dropin < /bigfile > ramdisk/out.lz4 
terminate called after throwing an instance of 'std::__ios_failure'
  what():  basic_ios::clear: iostream error

It seems quite difficult to reliably get errno out of C++, and many functions don't even document whether or not it is set.

Since #6 addresses 2 of the 6 points, I propose we keep this issue open and post updates/info on the other points as we find them.

Feb 15 '20 15:02 nh2

The 256 bytes buffer by default seems quite small.

I've done a super quick unsophisticated test with /usr/bin/time and lz4_dropin, which suggests that

compression is 10% slower than lz4
decompression is 2x slower than lz4

If I set the buffers to 16 KiB like

template <size_t SrcBufSize = (4096*4)>
class basic_ostream : public std::ostream

template <size_t SrcBufSize = (4096*4), size_t DestBufSize = (4096*4)>
class basic_istream : public std::istream

then the performance difference is almost completely gone.

Feb 15 '20 18:02 nh2

lz4_stream lz4_stream copied to clipboard

Some points I found while reading the code

lz4_stream
lz4_stream copied to clipboard