zstd seekable_format/examples/parallel_compression.c is not parallel

Describe the bug The parallel compression example does not seem to be working in parallel. It runs on a single thread only, regardless of how many threads are requested.

To Reproduce

Build the example parallel_compression
Run: ./parallel_compression data.txt $((2*1024*1024)) 4 data.txt can be any sufficiently large file so that compression takes a handful of seconds. In my case, this is ASCII data, 2.8 GB.
While the compression is running, look at e.g. top
Just a single thread is used. This is usually slower than the single threaded seekable_compression,

Expected behavior Parallel execution should use multiple cores and be faster than single threaded. The full zstd executable works as expected; given the -T option it is faster than without and saturates all cores.

Also, I'd love for --seekable to become a standard option for zstd, but that should probably be a separate issue.

Desktop:

OS: Linux
Version Ubuntu 22.04
Compiler: gcc 11.4
Flags: -O3
System: Quad-core Intel Core i7-1165G7
Build system Makefile
Built from git version: commit 559762da12f54712d44f619098aa4a7e7bc5727b, Mar 14 2024

Mar 19 '24 10:03 EbiArnie

This is because the parallel_compression example builds against the static library, which builds by default with multi-threading off.

You can get it working like so, from the examples directory:

make -j12 -C ../../../lib clean libzstd.a CPPFLAGS="-DZSTD_MULTITHREAD"
make parallel_compression
./parallel_compression some_large_file 100000 12

There's honestly a lot of problems with this example:

We're using POOL_* functions in such a way that the entire output is stored in memory!
If the POOL_* functions were working as the author seems to intend, it would be racy, since there's no barrier between writing job->done and checking jobs[i].done.
You can't use a non-seekable input file (eg: stdin), since we allocate an array of jobs based on the file size.
The CLI for parallel_compression isn't great. You can't specify an output filename, or a compression level, for example. And you have to understand what is a reasonable frame size.
You can't build by just passsing -j12 or CPPFLAGS to make parallel_compression, because the makefile builds the library using an entirely different make invocation.
You can't link against the dynamic library (which does have multi-threading enabled by default), because we use the internal POOL_* functions.

To fix these:

We could fix issue 5 by using $(MAKE) instead of make.
We could fix issue 6 by adding pool.c to the parallel_compression build I guess? Or a cleaner fix might be adding an internal convenience library for the pool functions.
We could fix issue 4 by expanding on this example, though I'm not sure whether this is welcome in the zstd repo. Maybe the solution is just building an external program like t2sz.
Fixing issues 1-3 really needs a change in how this example's I/O works. Ideally you'd have one thread that just stuffs frames into the pool, but then when a frame is finished it has to be written relatively quickly.

@Cyan4973 I'd be happy to look into this further, but I'd like guidance on what is/isn't welcome in this repo before I start if you don't mind. (I've written one of the most performant parallel-xz implementations, if you need proof I can do this.) Is there willingness to investigate adding --seekable to the standard zstd binary? If not, what sorts of changes to the example would feel ok to you, and which wouldn't?

Thanks!

Sep 30 '24 05:09 vasi

Hey, thanks for the detailed analysis.

Having --seekable be part of default zstd would be very nice indeed. Meanwhile, I'm using exactly the tool you linked to, t2sz.

As far as this example goes - it would be good to have working code of course.

Sep 30 '24 10:09 EbiArnie

Yes, I think it's a good idea to start working on this extension. --seekable seems like a nice way to express it.

I'm slightly concerned about the general complexity of the complete solution, but as long as it can be broken down into smaller incremental steps of tractable complexity, I'm sure it will be possible to progress regularly towards the goal.

Sep 30 '24 21:09 Cyan4973

Yes, I'm a little concerned about the complexity as well. A few reasons:

Blocks are limited to 128K I think, so we'll realistically need multiple frames, each of which contains multiple blocks.
Preserving concurrency while adding multiple frames seems difficult, since our concurrency is currently internal to ZSTD_compressStream2(), which the docs at least imply compresses just one frame.

The options I see for expressing seekable mode are:

Add seekable mode internal to libzstd, as a parameter. We add a paramter like ZSTD_c_frameSize=BYTES. Then make ZSTD_compressStream2 and friends just write frame headers/footers at the right places. We just re-use our existing concurrency model. The docs change to make it clear that compressStream2 can output multiple frames. a. There are a few potential variants of this, eg: making the parameter "number of blocks in a frame", or reusing jobSize as the frame size. b. getFrameProgression() becomes very poorly named now, ugh. Maybe we add an alias getProgression()?
Add seekable mode internal to libzstd, but with new API functions to call. This could prevent changes to the implied single-frame behavior of functions like compressStream2. But it may complicate the API needlessly.
Add seekable mode only in the zstd program, with no changes to libzstd. I really don't like this, it doesn't seem fun to add concurrency to zstd in a way that doesn't interfere with the concurrency already present in libzstd.

I'm partial to option 1.

Sep 30 '24 22:09 vasi

The current seekable format is designed around the idea that each independent chunk is a frame. So yes, by design, it's a multi-frames topic.

Changing ZSTDMT_* to output multiple frames instead of a single one is technically possible. It's not even that difficult to imagine: just close and restart a frame at decided breakpoints. It introduces a few more questions about the nature of the interface, because now additional information (boundaries between frame, or eventually content size within each frame) becomes useful to transmit (to write the jump table at the end for example).

But it's also a high risk operation, because this code is used to compress everything from the zstd CLI. So a subtle vulnerability introduced here would have a pretty huge blast radius.

Option 3 has the advantage to contain the risks to the newly added code only, with less chances to impact normal operations.

A possible intermediate could be to duplicate zstdmt_compress in order to make all the modifications there, eventually taking over the original one if it manages to become a clear and safe superset. Maybe it's what you called Option 2.

Sep 30 '24 23:09 Cyan4973

Ok, I will try that approach. I'm going on vacation soon, so maybe in 6 weeks I'll back with more :)

Oct 01 '24 01:10 vasi

Sounds promising. I would still be happy with --seekable even without parallelism, but it sounds like you have it mostly figured out already. Thanks!

Oct 01 '24 08:10 EbiArnie

I'm also interested in this; am curious if there is any working code to try?

Dec 20 '24 14:12 ShaheedHaque

No, I haven't looked at this in detail yet. Let me know if you discover anything!

I did add parallelism to the Go seekable-zst implementation, which is an ok option for now, but you lose some of the fancy features of the zstd tool: https://github.com/SaveTheRbtz/zstd-seekable-format-go

Dec 20 '24 17:12 vasi

Well, my C isn't what it was 35 years ago but as you asked, I'd start from completely the opposite approach. Looking at https://github.com/facebook/zstd/blob/f0937b83d9a32cb2b59f99bbc4db717ae6e83c9b/lib/compress/zstd_compress.c#L6392 I would add seekable support first, with just a command line switch. After all, we'd be on a branch, not main. Once working, that first version could even be merged given the obvious interest (and utility, e.g. in a command pipeline). The idea being that unless stdin works, I doubt a parallel implementation would ever be merged.

Then, the parallel work. Even if the first seekable design is suboptimal for the parallel case, there's no reason it couldn't be adjusted once the deficit is identified.

Dec 22 '24 06:12 ShaheedHaque