zstd
zstd copied to clipboard
seekable_format no zstd header compressing empty string to stream
Dear zstd maintainers,
I believe I ran into a small bug in zstd_seekable. When compressing an empty string, the output of the compressed data does not include the zstd seekable header:
// Example 1, compressing "test"
char *data_in = (char *) malloc_or_die(255 * sizeof(char));
data_in = "test\0";
char *data_out = (char *) malloc_or_die(255 * 255 * sizeof(char));
ZSTD_seekable_CStream *s = ZSTD_seekable_createCStream();
size_t const initResult = ZSTD_seekable_initCStream(s, 1, 1, 1024 * 1024);
ZSTD_inBuffer input = { data_in, strlen(data_in), 0 };
ZSTD_outBuffer output = { data_out, 255*255, 0 };
size_t toRead = ZSTD_seekable_compressStream(s, &output, &input);
size_t const remainingToFlush = ZSTD_seekable_endStream(s, &output);
printf("size compressed output: %i\n",output.pos);
ZSTD_seekable_freeCStream(s);
// Example 2, compressing "" (empty string)
char *data_in = (char *) malloc_or_die(255 * sizeof(char));
data_in = "\0";
char *data_out = (char *) malloc_or_die(255 * 255 * sizeof(char));
ZSTD_seekable_CStream *s = ZSTD_seekable_createCStream();
size_t const initResult = ZSTD_seekable_initCStream(s, 1, 1, 1024 * 1024);
ZSTD_inBuffer input = { data_in, strlen(data_in), 0 };
ZSTD_outBuffer output = { data_out, 255*255, 0 };
size_t toRead = ZSTD_seekable_compressStream(s, &output, &input);
size_t const remainingToFlush = ZSTD_seekable_endStream(s, &output);
printf("size compressed output: %i\n",output.pos);
ZSTD_seekable_freeCStream(s);
results in the following two compressed strings:
test 1:
00000000: 28b5 2ffd 0048 2100 0074 6573 745e 2a4d (./..H!..test^*M
00000010: 1815 0000 000d 0000 0004 0000 0039 8167 .............9.g
00000020: db01 0000 0080 b1ea 928f ..........
test 2:
00000000: 5e2a 4d18 0900 0000 0000 0000 80b1 ea92 ^*M.............
00000010: 8f .
In case an empty string is compressed (test 2), the compressed output does not start with the ZSTD MAGIC but with a skippable frame (5e2a 4d18
) directly.
I think I have found the issue and resolved it in this PR. After the patch, it outputs the following:
00000000: 28b5 2ffd 2000 0100 005e 2a4d 1815 0000 (./. ....^*M....
00000010: 0009 0000 0000 0000 0099 e9d8 5101 0000 ............Q...
00000020: 0080 b1ea 928f ......
The PR also contains a test case testing for the ZSTD MAGIC in the compressed output of an empty string.
all best,
Youri
Hi @iburinoc @Cyan4973,
From looking into the git history, I think you are the most technically grounded committers regarding this part of the codebase. Would you like to take a look at it? I believe it's a rather simple commit, but the patch would be really helpful in proceeding with our bioinformatics DNA sequence compressor.
kind regards,
Youri
Closed and re-opened the PR in order to trigger the CI tests to run again.
Closing this in favor of a superseding PR by @daniellerozenblit