zstd icon indicating copy to clipboard operation
zstd copied to clipboard

Multithreaded improvements

Open j-colburn opened this issue 4 years ago • 2 comments

It looks like the the multithreaded implementation in the base zstd is a bit slower than the other available threading implementations: zstd-mt : https://github.com/mcmilk/zstdmt/releases pzstd: https://github.com/facebook/zstd/tree/dev/contrib/pzstd

One issue is the cpu detection, with -T0 zstd doesn't count hyperthreading cores, but at higher compression ratios they can provide additional speedup.

Another issue is decompression doesn't use threads, even adding a couple threads dedicated to reading and writing would allow speedup of single block streams. Perhaps decompression is fast enough that true multithreaded decompression per block (like plzip, lbzip2, or pbzip2) would not provide much benefit.

Jon

j-colburn avatar Feb 16 '21 23:02 j-colburn

This is a correct description.

Multi-threaded decompression is in our task list, although there is no release date set yet.

One reason is indeed that decompression speed is generally fast enough with a single thread (faster than SSD).

Another reason is that it's complex. The core issue here is that zstd -T# produces a single compact frame, as opposed to pzstd and mcmilk variants which produce multiple independent payloads. Decompressing in parallel multiple independent payloads can be done fairly easily, while untangling dependencies within a single frame is more complex.

So why generating a single frame ? One consequence is that zstd -T# doesn't lose compression ratio while increasing the nb of threads. In contrast, these "separate payloads" strategies come at a cost, with each new block starting by compressing less, and then catching up later on. Another consequence is that zstd -T0 generates the same compressed payload whatever the nb of threads on target platform, which is nice for reproducibility. Finally, there is a header benefit, where the entire content is described in a single header, at the beginning, while the multiple-payloads strategy needs to discover it payload after payload, and therefore doesn't know upfront how much data it encapsulates.

Anyway, we expect some progresses on this topic in the future, just not on short term.

Cyan4973 avatar Feb 17 '21 18:02 Cyan4973

One reason is indeed that decompression speed is generally fast enough with a single thread (faster than SSD).

That is becoming less true as time marches on and people are switching to modern NVMe.

Fastest CPU on openbenchmarking.org can do level 3 decompression at 2.4GB/s, but my current SSD does 7.3GB/s reads.

I don't own the fastest CPU to have those speeds... nor the fastest SSD for that matter, it gets better in the high price ranges.

Are there any news on the topic since the last couple years?
Thanks in advance!

C0rn3j avatar Nov 26 '24 15:11 C0rn3j