c-blosc
c-blosc copied to clipboard
allow 64-bit buffer sizes
blosc_compress
returns the size of compressed block according to blosc.h
. However, the return type is int
, which has an obvious susceptibility to overflow in the common case of a 64-bit machine where int
is 32 bits.
Update: Upon looking through blosc.c, it seems that 32-bit limits are found throughout the code, not just in the API. At least on 64-bit machines, it would be highly desirable to support 64-bit buffer sizes.
Ideally, you would return size_t
. But this is unsigned, and I see that you want to use negative return values to indicate errors. Two possibilities:
- Return
ptrdiff_t
(C90) or (in C99)intptr_t
. This is a 64-bit signed value on 64-bit machines, and a 32-bit signed value on 32-bit machines. - Return
size_t
, and return an error code in some other way.errno
would be the obvious way forblosc_compress
. Forblosc_compress_ctx
one option would be to add anint *errno
parameter.
IIRC the size of the input buffer is is subject to BLOSC_MAX_BUFFERSIZE
. This is defined as:
/* Maximum buffer size to be compressed */
#define BLOSC_MAX_BUFFERSIZE (INT_MAX - BLOSC_MAX_OVERHEAD)
Where BLOSC_MAX_OVERHEAD
is the size of the header at the start of the compressed buffer.
So I guess the initial thought was to use int
since the size of the compressed buffer will never exceed the maximum integer value. In fact, in the worst case the input buffer will be of size BLOSC_MAX_BUFFERSIZE
and not compressible so the size of the output buffer will be exactly of size INT_MAX
.
Do you think that will be enough or should we be ideally be programming defensively?
This just raises the question: why are you using INT_MAX
and not SIZE_MAX
for the maximum buffer size?
The world is 64 bits nowadays...it seems short-sighted to limit yourself by design to 32-bit buffers.
On a cursory glance through the source code, I'm not seeing any fundamental reason why you need to use int32_t
everywhere. Why not use intptr_t
for all of the counters and sizes (or size_t
if you don't need signed values) and expand your header to 64-bit lengths?
Are you worried about binary compatibility with 32-bit machines? Wouldn't it be sufficient to define the binary format as storing 64-bit lengths in the compression header, and read these 64-bit lengths on both 32-bit and 64-bit machines, but return an error on a 32-bit machine if the user requests decompression of a buffer whose header indicates a size bigger than SIZE_MAX
?
It seems like the only reason that Google's Snappy limited itself to 32-bit sizes was backwards compatibility.
The main reason for choosing 32-bit buffer sizes was to save space for the header (see README_HEADER.rst
in the home dir). Blosc is meant to compress small to medium buffers, so keeping that under a minimum was deemed important.
For Blosc2 we might consider a new header, and I am open to evaluate a variation that allows 64-bit buffers. Added the idea in rev 49b4240.
Thanks. (Shouldn't you leave the issue open to track work on that feature, if you are open to implementing it?)
You could always use the first few bytes in the header (the version, or the flags) to decide whether it is a short header or a long header (with the latter only for buffers whose size overflows 32 bits). That way you wouldn't sacrifice any space in the common case, and could remain backwards compatible for 32-bit buffers.
This is a very important thing and will not be overlooked for sure, so do not worry :)
2014-11-11 20:10 GMT+01:00 Steven G. Johnson [email protected]:
(Shouldn't you leave the issue open to track work on that feature, if you are open to implementing it?)
You could always use the first word in the header to decide whether it is a short header or a long header (with the latter only for buffers whose size overflows 32 bits.)
— Reply to this email directly or view it on GitHub https://github.com/Blosc/c-blosc/issues/67#issuecomment-62599462.
Francesc Alted
The advantage of leaving the issue open is that it makes it easier to follow progress; e.g. I will get a notification on Github when the issue is closed by a patch. Moreover, anyone clicking on the issue from a link (e.g. from a mailing list) will instantly see whether it has been resolved yet. (You could mark it with an "enhancement" label to make it clear that this is a feature request and not a bug per se.)
Yeah, that's a good point. Reopening.
BTW, after the discussions you still think that a 64-bit header is a good idea for c-blosc itself?
Yes. It makes the library quite awkward to use for in memory buffers---like a version of memcpy that has a 32-bit limit. And there seems likely to be no performance/memory penalty to support 64-bit buffers on 64-bit machines.
I would also like to use 64-bit blosc. 32-bit is no longer the widely available default.
@GlenHertz just to be clear, c-blosc supported 64-bit compilation mode since the beginning. What we are talking about here is to possibility to have buffers larger than 2 GB (32-bit). As said, this will take place in Blosc2.
Is someone working on 64bit support. I plan to integrate blocs as transformation layer to ADIOS but without 64bit buffers I need to split large buffers to smaller one.
@psychocoderHPC Sorry but there are no plans for suppoting 64bit buffers in Blosc 1. You might be interested in Blosc2 which has full 64bit support for buffers. And although Blosc2 is not ready for production yet, I'd be grateful to see people having a try at it and reporting back possible issues.
You can also look at https://github.com/Blosc/bloscpack which has a framing format and chunking code for larger Blosc buffers.