zstd icon indicating copy to clipboard operation
zstd copied to clipboard

seekable_format no zstd header compressing empty string to stream

Open yhoogstrate opened this issue 2 years ago • 1 comments

Dear zstd maintainers,

I believe I ran into a small bug in zstd_seekable. When compressing an empty string, the output of the compressed data does not include the zstd seekable header:

// Example 1, compressing "test"

char *data_in = (char *) malloc_or_die(255 * sizeof(char));
data_in = "test\0";

char *data_out = (char *) malloc_or_die(255 * 255 * sizeof(char));

ZSTD_seekable_CStream *s = ZSTD_seekable_createCStream();
size_t const initResult = ZSTD_seekable_initCStream(s, 1, 1, 1024 * 1024);

ZSTD_inBuffer input = { data_in, strlen(data_in), 0 };
ZSTD_outBuffer output = { data_out, 255*255, 0 };

size_t toRead = ZSTD_seekable_compressStream(s, &output, &input);
size_t const remainingToFlush = ZSTD_seekable_endStream(s, &output);

printf("size compressed output: %i\n",output.pos);

ZSTD_seekable_freeCStream(s);
// Example 2, compressing "" (empty string)

char *data_in = (char *) malloc_or_die(255 * sizeof(char));
data_in = "\0";

char *data_out = (char *) malloc_or_die(255 * 255 * sizeof(char));

ZSTD_seekable_CStream *s = ZSTD_seekable_createCStream();
size_t const initResult = ZSTD_seekable_initCStream(s, 1, 1, 1024 * 1024);

ZSTD_inBuffer input = { data_in, strlen(data_in), 0 };
ZSTD_outBuffer output = { data_out, 255*255, 0 };

size_t toRead = ZSTD_seekable_compressStream(s, &output, &input);
size_t const remainingToFlush = ZSTD_seekable_endStream(s, &output);

printf("size compressed output: %i\n",output.pos);

ZSTD_seekable_freeCStream(s);

results in the following two compressed strings:

test 1:

00000000: 28b5 2ffd 0048 2100 0074 6573 745e 2a4d  (./..H!..test^*M
00000010: 1815 0000 000d 0000 0004 0000 0039 8167  .............9.g
00000020: db01 0000 0080 b1ea 928f                 ..........
test 2:

00000000: 5e2a 4d18 0900 0000 0000 0000 80b1 ea92  ^*M.............
00000010: 8f                                       .

In case an empty string is compressed (test 2), the compressed output does not start with the ZSTD MAGIC but with a skippable frame (5e2a 4d18) directly.

I think I have found the issue and resolved it in this PR. After the patch, it outputs the following:

00000000: 28b5 2ffd 2000 0100 005e 2a4d 1815 0000  (./. ....^*M....
00000010: 0009 0000 0000 0000 0099 e9d8 5101 0000  ............Q...
00000020: 0080 b1ea 928f                           ......

The PR also contains a test case testing for the ZSTD MAGIC in the compressed output of an empty string.

all best,

Youri

yhoogstrate avatar Feb 08 '22 13:02 yhoogstrate

Hi @iburinoc @Cyan4973,

From looking into the git history, I think you are the most technically grounded committers regarding this part of the codebase. Would you like to take a look at it? I believe it's a rather simple commit, but the patch would be really helpful in proceeding with our bioinformatics DNA sequence compressor.

kind regards,

Youri

yhoogstrate avatar Mar 23 '22 15:03 yhoogstrate

Closed and re-opened the PR in order to trigger the CI tests to run again.

daniellerozenblit avatar Dec 12 '22 21:12 daniellerozenblit

Closing this in favor of a superseding PR by @daniellerozenblit

yhoogstrate avatar Dec 14 '22 12:12 yhoogstrate