c-blosc icon indicating copy to clipboard operation
c-blosc copied to clipboard

allow 64-bit buffer sizes

Open stevengj opened this issue 9 years ago • 16 comments

blosc_compress returns the size of compressed block according to blosc.h. However, the return type is int, which has an obvious susceptibility to overflow in the common case of a 64-bit machine where int is 32 bits.

Update: Upon looking through blosc.c, it seems that 32-bit limits are found throughout the code, not just in the API. At least on 64-bit machines, it would be highly desirable to support 64-bit buffer sizes.

Ideally, you would return size_t. But this is unsigned, and I see that you want to use negative return values to indicate errors. Two possibilities:

  • Return ptrdiff_t (C90) or (in C99) intptr_t. This is a 64-bit signed value on 64-bit machines, and a 32-bit signed value on 32-bit machines.
  • Return size_t, and return an error code in some other way. errno would be the obvious way for blosc_compress. For blosc_compress_ctx one option would be to add an int *errno parameter.

stevengj avatar Nov 10 '14 20:11 stevengj

IIRC the size of the input buffer is is subject to BLOSC_MAX_BUFFERSIZE. This is defined as:

/* Maximum buffer size to be compressed */
#define BLOSC_MAX_BUFFERSIZE (INT_MAX - BLOSC_MAX_OVERHEAD)

Where BLOSC_MAX_OVERHEAD is the size of the header at the start of the compressed buffer.

So I guess the initial thought was to use int since the size of the compressed buffer will never exceed the maximum integer value. In fact, in the worst case the input buffer will be of size BLOSC_MAX_BUFFERSIZE and not compressible so the size of the output buffer will be exactly of size INT_MAX.

Do you think that will be enough or should we be ideally be programming defensively?

esc avatar Nov 10 '14 20:11 esc

This just raises the question: why are you using INT_MAX and not SIZE_MAX for the maximum buffer size?

The world is 64 bits nowadays...it seems short-sighted to limit yourself by design to 32-bit buffers.

stevengj avatar Nov 10 '14 22:11 stevengj

On a cursory glance through the source code, I'm not seeing any fundamental reason why you need to use int32_t everywhere. Why not use intptr_t for all of the counters and sizes (or size_t if you don't need signed values) and expand your header to 64-bit lengths?

Are you worried about binary compatibility with 32-bit machines? Wouldn't it be sufficient to define the binary format as storing 64-bit lengths in the compression header, and read these 64-bit lengths on both 32-bit and 64-bit machines, but return an error on a 32-bit machine if the user requests decompression of a buffer whose header indicates a size bigger than SIZE_MAX?

stevengj avatar Nov 11 '14 01:11 stevengj

It seems like the only reason that Google's Snappy limited itself to 32-bit sizes was backwards compatibility.

stevengj avatar Nov 11 '14 01:11 stevengj

The main reason for choosing 32-bit buffer sizes was to save space for the header (see README_HEADER.rst in the home dir). Blosc is meant to compress small to medium buffers, so keeping that under a minimum was deemed important.

For Blosc2 we might consider a new header, and I am open to evaluate a variation that allows 64-bit buffers. Added the idea in rev 49b4240.

FrancescAlted avatar Nov 11 '14 17:11 FrancescAlted

Thanks. (Shouldn't you leave the issue open to track work on that feature, if you are open to implementing it?)

You could always use the first few bytes in the header (the version, or the flags) to decide whether it is a short header or a long header (with the latter only for buffers whose size overflows 32 bits). That way you wouldn't sacrifice any space in the common case, and could remain backwards compatible for 32-bit buffers.

stevengj avatar Nov 11 '14 19:11 stevengj

This is a very important thing and will not be overlooked for sure, so do not worry :)

2014-11-11 20:10 GMT+01:00 Steven G. Johnson [email protected]:

(Shouldn't you leave the issue open to track work on that feature, if you are open to implementing it?)

You could always use the first word in the header to decide whether it is a short header or a long header (with the latter only for buffers whose size overflows 32 bits.)

— Reply to this email directly or view it on GitHub https://github.com/Blosc/c-blosc/issues/67#issuecomment-62599462.

Francesc Alted

FrancescAlted avatar Nov 12 '14 07:11 FrancescAlted

The advantage of leaving the issue open is that it makes it easier to follow progress; e.g. I will get a notification on Github when the issue is closed by a patch. Moreover, anyone clicking on the issue from a link (e.g. from a mailing list) will instantly see whether it has been resolved yet. (You could mark it with an "enhancement" label to make it clear that this is a feature request and not a bug per se.)

stevengj avatar Nov 13 '14 16:11 stevengj

Yeah, that's a good point. Reopening.

FrancescAlted avatar Nov 14 '14 12:11 FrancescAlted

BTW, after the discussions you still think that a 64-bit header is a good idea for c-blosc itself?

FrancescAlted avatar Nov 14 '14 12:11 FrancescAlted

Yes. It makes the library quite awkward to use for in memory buffers---like a version of memcpy that has a 32-bit limit. And there seems likely to be no performance/memory penalty to support 64-bit buffers on 64-bit machines.

stevengj avatar Nov 14 '14 13:11 stevengj

I would also like to use 64-bit blosc. 32-bit is no longer the widely available default.

GlenHertz avatar Jan 08 '15 03:01 GlenHertz

@GlenHertz just to be clear, c-blosc supported 64-bit compilation mode since the beginning. What we are talking about here is to possibility to have buffers larger than 2 GB (32-bit). As said, this will take place in Blosc2.

FrancescAlted avatar Jul 05 '15 10:07 FrancescAlted

Is someone working on 64bit support. I plan to integrate blocs as transformation layer to ADIOS but without 64bit buffers I need to split large buffers to smaller one.

psychocoderHPC avatar Nov 26 '16 17:11 psychocoderHPC

@psychocoderHPC Sorry but there are no plans for suppoting 64bit buffers in Blosc 1. You might be interested in Blosc2 which has full 64bit support for buffers. And although Blosc2 is not ready for production yet, I'd be grateful to see people having a try at it and reporting back possible issues.

FrancescAlted avatar Nov 28 '16 10:11 FrancescAlted

You can also look at https://github.com/Blosc/bloscpack which has a framing format and chunking code for larger Blosc buffers.

esc avatar Nov 28 '16 11:11 esc