lz4_stream
lz4_stream copied to clipboard
Some points I found while reading the code
Hi, thanks for your project!
I considered using it but decided to pipe out to the lz4
utility instead (for now). Yet, I wanted to write downs some points so that the time I spent reading its code does not go unused.
(I maintain the Haskell binding to lz4frame.h
and thus am familiar with the fact that the lz4 API can be tricky, and I'm also interested in double-checking whether my understanding is correct to improve my own library if they aren't.)
These are the points/questions:
- Potential memory leak:
The constructor
output_buffer(std::ostream &sink)
seems to be able to throw in the last function it calls,write_header();
. When that happens,LZ4F_createCompressionContext()
will have malloc'ed actx_
and the destructor will not be called to free it. Now I'm not very sure whetherwrite_header()
will throw beyond the explicitthrow
that the library does, which should not happen unless there's a bug. Is it possible forsink_.write()
to throw in any situation, or will failures (such asbad_alloc
, disk full, TCP client closed the connection of the downstream stream) always be silent? In any case, it may make sense to make a guaranteed non-throwingwrite_header()
for use in the constructor, and ensure all control flow paths out of the constructor past thectx_
creation callLZ4F_freeCompressionContext(ctx_)
. - The implementation seems to do a multiple function calls and a stream yield for each incoming byte (
overflow()
usespbump(1)
), which seems extremely inefficient given thatoverflow()
is virtual. Is this correct, or am I missing something? - The 256 bytes buffer by default seems quite small. What motivated that value?
- In this line https://github.com/laudrup/lz4_stream/blob/6b015cbe786291733b6f3aa03e45d307bc4ae527/include/lz4_stream.h#L120 it's using
sink_.write()
even though the output buffer may be empty (somewhat inefficient), because the lz4 context has an internal buffer. Would it make sene to skip thewrite()
in that case? - In this line https://github.com/laudrup/lz4_stream/blob/6b015cbe786291733b6f3aa03e45d307bc4ae527/include/lz4_stream.h#L65 it might be beneficial to
+ LZ4F_HEADER_SIZE_MAX
for efficiency, to ensure both the header and the first chunk fit into the allocated buffer (like I do in Haskell here). - The examples like
lz4_compress.cpp
don't do any error handling, like complaining when the input file does not exist or when the output file cannot be written to. They exhibit somewhat funny behaviour, such as that./lz4_compress nonexistent outfile
generates some bytes intoout
even though the input file does not exist, and it starts reading the input file even after theopenat()
syscall has already failed (checked withstrace
). While that's not a real problem, it can be confusing if you're playing the the example and thinking that it's working, when silent failure is actually happening.
Thanks for taking a look!
Hi Niklas,
Thanks a lot for your comments. It's very nice to have someone look at your code especially when they can come with such useful comments like you do.
-
That's is correct and something I remember I wanted to look into but never got around to it. It might be possible to fix by wrapping
ctx_
in aunique_ptr
, but I would have to look into the details. -
I don't think you're missing anything. It does indeed sound very inefficient which is especially problematic considering lz4 primary design goal i performance. That should definitely be fixed.
-
Nothing motivated that value. I have no idea why I chose that size, no benchmarking or anything has been done to chose that.
-
Sounds reasonable. Would be extra nice if the unittest could be extended to handle that case.
-
Also sounds very reasonable.
-
I guess I just wanted the example programs to be as simple as possible, but you are quite right that they might be a bit too simple :-) Failing silently is never good.
Thanks once again for some very, very useful comments and thanks even more for providing a pull request (which I'll look into shortly).
I cannot promise when I'll have time to look into it, so I'll be very happy to receive more pull requests.
Kind regards,
Kasper
Thanks for you reply!
- It might be possible to fix by wrapping
ctx_
in aunique_ptr
, but I would have to look into the details.
When I made PR #6 I also first tried with unique_ptr
, but I realised it does not really work nicely because the LZ4F_decompressionContext_t
is already the pointer type and the T
in unique_ptr<T>
would have to contain what that's pointed to (struct LZ4F_dctx_s
), which would mean operating on things that don't look like part of the familar public LZ4 API. So I decided to do the simple explicit RAII approach instead with the added decompression_context
.
- I don't think you're missing anything. It does indeed sound very inefficient which is especially problematic considering lz4 primary design goal i performance. That should definitely be fixed.
Do you have a good resource to learn exhaustively about how C++ streams work? I understand the part that lz4_stream currently uses, but in a somewhat related project, where we try to combine streams of boost_archive
and boost_process
to pipe a serialisation stream into a stream that spawns the lz4
command line utility, we get odd exceptions in sync()
and strace
shows a write(fd=-1, "", count=0) = EBADF
syscall, which seems clearly wrong. Most resources I've found so far just talk about stream basics and do not fully describe how one writes a very good, high-performant stream implementation.
- Nothing motivated that value. I have no idea why I chose that size, no benchmarking or anything has been done to chose that.
OK. That may be interesting to look into in the future.
There are 2 things to consider:
- The buffer size interfacing with the C++ streaming framework.
- The buffer size interfacing with lz4.
You currently use SrcBufSize = 256
for both; it determines the src_buf_.size()
which is then given to both setp()
and dest_buf_(LZ4F_compressBound(src_buf_.size(), ...))
. It would theoretically also be possible to decouple the two, using different sizes for those two concepts, but I don't know whether it would be beneficial.
For (1), the main data I've found so far is https://stackoverflow.com/questions/12997131/stdfstream-buffering-vs-manual-buffering-why-10x-gain-with-manual-buffering, where the second graph shows that < 1024 is very fast. One of the answers explains that this is due to hardcoded constants and the syscall implementaiton in the fstream
implementation; though I do not fully follow the second answer (it suggests that the way writev()
is used changes but doesn't really seem clear to me, nor could I find it in the code).
For (2), I don't know yet how the lz4 performance changes with buffer size and what the call overhead is, but I imagine that call overhead can be significant at 256 bytes buffers because the function usually lives in liblz4.so
so no inlining is done and it's a real jump.
- Sounds reasonable. Would be extra nice if the unittest could be extended to handle that case.
:heavy_check_mark:
- Also sounds very reasonable.
:heavy_check_mark:
- I guess I just wanted the example programs to be as simple as possible, but you are quite right that they might be a bit too simple :-) Failing silently is never good.
PR #6 fixes the silent failures by use of stream.exceptions(badbit)
.
It's still not great because C++'s stream exception handling is pretty poor on error reporting, e.g. a no space left on device
manifests only as:
% ./lz4_dropin < /bigfile > ramdisk/out.lz4
terminate called after throwing an instance of 'std::__ios_failure'
what(): basic_ios::clear: iostream error
It seems quite difficult to reliably get errno
out of C++, and many functions don't even document whether or not it is set.
Since #6 addresses 2 of the 6 points, I propose we keep this issue open and post updates/info on the other points as we find them.
- The 256 bytes buffer by default seems quite small.
I've done a super quick unsophisticated test with /usr/bin/time
and lz4_dropin
, which suggests that
- compression is 10% slower than
lz4
- decompression is 2x slower than
lz4
If I set the buffers to 16 KiB like
template <size_t SrcBufSize = (4096*4)>
class basic_ostream : public std::ostream
template <size_t SrcBufSize = (4096*4), size_t DestBufSize = (4096*4)>
class basic_istream : public std::istream
then the performance difference is almost completely gone.