Requests for adding internal multithreading for compression libraries
The new version of lzbench in master branch introduces the -I# option, which enables compression using # internal (built-in) threads when supported. Currently available only for the following compressors: fast-lzma2, kanzi, lzham, lzma, xz, and zstd.
The supported compressors must provide memory-to-memory compression APIs which allow to specify a number of threads e.g.:
int64_t lzbench_fastlzma2_compress(char *inbuf, size_t insize, char *outbuf, size_t outsize, codec_options_t *codec_options)
{
size_t ret = FL2_compressMt(outbuf, outsize, inbuf, insize, codec_options->level, codec_options->threads);
if (FL2_isError(ret)) return 0;
return ret;
}
The results with 8 threads (-I8) on tarred Silesia Corpus:
lzbench 2.1.1 | GCC 14.2.0 | 64-bit Linux | AMD EPYC 9554 64-Core Processor
Compressor I_Threads Compress. Decompress. Compr. size Ratio Filename
memcpy 1 16265 MB/s 16259 MB/s 211947520 100.00 ../silesia.tar
fastlzma2 1.0.1 -1 8 34.1 MB/s 408 MB/s 59573883 28.11 ../silesia.tar
fastlzma2 1.0.1 -5 8 37.2 MB/s 569 MB/s 51327104 24.22 ../silesia.tar
fastlzma2 1.0.1 -10 8 24.0 MB/s 108 MB/s 48668818 22.96 ../silesia.tar
lzham 1.0 -d26 -1 8 3.71 MB/s 373 MB/s 54899526 25.90 ../silesia.tar
lzham 1.0 -d26 -4 8 2.43 MB/s 407 MB/s 51177148 24.15 ../silesia.tar
lzma 25.01 -4 8 37.7 MB/s 259 MB/s 55968602 26.41 ../silesia.tar
lzma 25.01 -9 8 4.46 MB/s 103 MB/s 48683275 22.97 ../silesia.tar
xz 5.8.1 -4 8 35.7 MB/s 142 MB/s 52458204 24.75 ../silesia.tar
xz 5.8.1 -9 8 3.11 MB/s 146 MB/s 48766532 23.01 ../silesia.tar
zstd 1.5.7 -9 8 384 MB/s 1154 MB/s 59179159 27.92 ../silesia.tar
zstd 1.5.7 -14 8 56.8 MB/s 1180 MB/s 57396270 27.08 ../silesia.tar
zstd 1.5.7 -18 8 19.1 MB/s 1070 MB/s 53313874 25.15 ../silesia.tar
zstd 1.5.7 -22 8 2.27 MB/s 999 MB/s 52288411 24.67 ../silesia.tar
kanzi 2.4.0 -5 8 125 MB/s 261 MB/s 54013491 25.48 ../silesia.tar
kanzi 2.4.0 -6 8 81.1 MB/s 150 MB/s 49517551 23.36 ../silesia.tar
kanzi 2.4.0 -7 8 55.2 MB/s 79.0 MB/s 47308156 22.32 ../silesia.tar
bsc 3.3.11 -m0 -e2 1 14.9 MB/s 16.8 MB/s 48743676 23.00 ../silesia.tar
bsc 3.3.11 -m0 -e1 1 20.0 MB/s 28.1 MB/s 49295208 23.26 ../silesia.tar
bsc 3.3.11 -m5 -e1 1 34.6 MB/s 15.3 MB/s 49609096 23.41 ../silesia.tar
bsc 3.3.11 -m6 -e1 1 33.3 MB/s 14.1 MB/s 49246610 23.24 ../silesia.tar
bzip3 1.5.2 -1 1 13.5 MB/s 16.4 MB/s 50325695 23.74 ../silesia.tar
bzip3 1.5.2 -3 1 13.6 MB/s 16.2 MB/s 48318294 22.80 ../silesia.tar
bzip3 1.5.2 -6 1 12.9 MB/s 13.1 MB/s 47360186 22.35 ../silesia.tar
bzip3 1.5.2 -9 1 11.7 MB/s 11.6 MB/s 48753972 23.00 ../silesia.tar
@IlyaGrebnov Currenty bsc supports MT with compression_omp(). How hard would be to add for example:
int bsc_compress_mt(const unsigned char * input, unsigned char * output, int n, int lzpHashSize, int lzpMinLen, int blockSorter, int coder, int features, int threads);
@iczelia I see that MT in bzip3 is achieved using bz3_encode_blocks() in main.c. Please consier adding for example:
BZIP3_API int bz3_compress_mt(u32 block_size, const u8 * const in, u8 * out, size_t in_size, size_t * out_size, int threads);
Bzip3, like lbzip2 supports mutlithreading in the client. Nothing wrong with that. In lzbench connect it would be equivalent to `-T' option, with no internal threading. I think that's perfectly logical solution for bwt compressors as they operate on blocks anyway. Plzip does exactly the same - splits input into blocks and then compresses them separately, with no internal threading in lzlib.
@tansy. The problem shows up when one wants to compare bsc to other compressors. With -T some ST-only compressors artificially show MT comp/decomp numbers which is unfair because it does not represent real performance outside of the bench. With the -I option, bsc is only ST currently, so unfair to compare to other MT compressors supporting -I.
@IlyaGrebnov Currenty bsc supports MT with
compression_omp(). How hard would be to add for example:int bsc_compress_mt(const unsigned char * input, unsigned char * output, int n, int lzpHashSize, int lzpMinLen, int blockSorter, int coder, int features, int threads);
It’s possible, but not trivial. In bsc I cap the thread count with omp_set_num_threads(numThreads).
As @flanglet noted, we need to compare apples to apples. bsc supports two parallelism models:
- Block-level parallelism: compress many independent blocks in parallel; each block runs single-threaded. You control concurrency by the number of blocks in flight.
- Intra-block parallelism: compress a single block using multiple threads; the output is identical to single-threaded compression, but it can use multiple cores. Only a few compressors support this mode, and even fewer scale efficiently.
which is unfair because it does not represent real performance outside of the bench
@flanglet: Then you use single threaded and calculate multi threaded yourself. That's only fair.
Block-level parallelism: compress many independent blocks in parallel;
It's that equivalent to`-T'?
Block-level parallelism: compress many independent blocks in parallel;
It's that equivalent to`-T'?
This is implemented in bsc.cpp (not in the libbsc library) and is controlled by the host. For block-level parallelism, launch N threads however you prefer (ideally using method consistent across other codecs) and keep libbsc single-threaded by omitting the LIBBSC_FEATURE_MULTITHREADING flag. For intra-block parallelism, treat the input as a single block, set the thread cap with omp_set_num_threads(numThreads) and enable multi-threading by passing LIBBSC_FEATURE_MULTITHREADING to the library.
"Then you use single threaded and calculate multi threaded yourself".
It don't think it works because 1) some compressors are ST for decompression and MT for compression, some others are MT-MT and others ST-ST, which means that the user has to know how each compressor works 2) more importantly all MT compressors do not scale the same way, so one cannot infer MT numbers by simple multiplication.
The 2 parallelism models look like they match the -T and -I options.
launch N threads however you prefer (ideally using method consistent across other codecs) and keep libbsc single-threaded
This is our -T option and works for all compressors except these using global variables.
For intra-block parallelism, treat the input as a single block, set the thread cap with omp_set_num_threads(numThreads) and enable multi-threading by passing LIBBSC_FEATURE_MULTITHREADING to the library.
Implemented at https://github.com/inikep/lzbench/pull/239
So, 'block-level parallelism' == `-T' option in lzbench. Therefore `-I' would be 'Intra-block parallelism'. All is needed is ability to pass number of threads from `-I' to libbsc, so it was clear how many threads user wants to use. It would be good if libbsc respected that and used it, but that's not absolutely necessary.
I created https://github.com/iczelia/bzip3/issues/171