lzbench icon indicating copy to clipboard operation
lzbench copied to clipboard

Requests for adding internal multithreading for compression libraries

Open inikep opened this issue 2 months ago • 11 comments

The new version of lzbench in master branch introduces the -I# option, which enables compression using # internal (built-in) threads when supported. Currently available only for the following compressors: fast-lzma2, kanzi, lzham, lzma, xz, and zstd.

The supported compressors must provide memory-to-memory compression APIs which allow to specify a number of threads e.g.:

int64_t lzbench_fastlzma2_compress(char *inbuf, size_t insize, char *outbuf, size_t outsize, codec_options_t *codec_options)
{
    size_t ret = FL2_compressMt(outbuf, outsize, inbuf, insize, codec_options->level, codec_options->threads);
    if (FL2_isError(ret)) return 0;
    return ret;
}

The results with 8 threads (-I8) on tarred Silesia Corpus:

lzbench 2.1.1 | GCC 14.2.0 | 64-bit Linux | AMD EPYC 9554 64-Core Processor                

Compressor      I_Threads Compress. Decompress. Compr. size  Ratio Filename
memcpy                  1 16265 MB/s 16259 MB/s   211947520 100.00 ../silesia.tar
fastlzma2 1.0.1 -1      8  34.1 MB/s   408 MB/s    59573883  28.11 ../silesia.tar
fastlzma2 1.0.1 -5      8  37.2 MB/s   569 MB/s    51327104  24.22 ../silesia.tar
fastlzma2 1.0.1 -10     8  24.0 MB/s   108 MB/s    48668818  22.96 ../silesia.tar
lzham 1.0 -d26 -1       8  3.71 MB/s   373 MB/s    54899526  25.90 ../silesia.tar
lzham 1.0 -d26 -4       8  2.43 MB/s   407 MB/s    51177148  24.15 ../silesia.tar
lzma 25.01 -4           8  37.7 MB/s   259 MB/s    55968602  26.41 ../silesia.tar
lzma 25.01 -9           8  4.46 MB/s   103 MB/s    48683275  22.97 ../silesia.tar
xz 5.8.1 -4             8  35.7 MB/s   142 MB/s    52458204  24.75 ../silesia.tar
xz 5.8.1 -9             8  3.11 MB/s   146 MB/s    48766532  23.01 ../silesia.tar
zstd 1.5.7 -9           8   384 MB/s  1154 MB/s    59179159  27.92 ../silesia.tar
zstd 1.5.7 -14          8  56.8 MB/s  1180 MB/s    57396270  27.08 ../silesia.tar
zstd 1.5.7 -18          8  19.1 MB/s  1070 MB/s    53313874  25.15 ../silesia.tar
zstd 1.5.7 -22          8  2.27 MB/s   999 MB/s    52288411  24.67 ../silesia.tar

kanzi 2.4.0 -5          8   125 MB/s   261 MB/s    54013491  25.48 ../silesia.tar
kanzi 2.4.0 -6          8  81.1 MB/s   150 MB/s    49517551  23.36 ../silesia.tar
kanzi 2.4.0 -7          8  55.2 MB/s  79.0 MB/s    47308156  22.32 ../silesia.tar
bsc 3.3.11 -m0 -e2      1  14.9 MB/s  16.8 MB/s    48743676  23.00 ../silesia.tar
bsc 3.3.11 -m0 -e1      1  20.0 MB/s  28.1 MB/s    49295208  23.26 ../silesia.tar
bsc 3.3.11 -m5 -e1      1  34.6 MB/s  15.3 MB/s    49609096  23.41 ../silesia.tar
bsc 3.3.11 -m6 -e1      1  33.3 MB/s  14.1 MB/s    49246610  23.24 ../silesia.tar
bzip3 1.5.2 -1          1  13.5 MB/s  16.4 MB/s    50325695  23.74 ../silesia.tar
bzip3 1.5.2 -3          1  13.6 MB/s  16.2 MB/s    48318294  22.80 ../silesia.tar
bzip3 1.5.2 -6          1  12.9 MB/s  13.1 MB/s    47360186  22.35 ../silesia.tar
bzip3 1.5.2 -9          1  11.7 MB/s  11.6 MB/s    48753972  23.00 ../silesia.tar

inikep avatar Oct 13 '25 07:10 inikep

@IlyaGrebnov Currenty bsc supports MT with compression_omp(). How hard would be to add for example:

int bsc_compress_mt(const unsigned char * input, unsigned char * output, int n, int lzpHashSize, int lzpMinLen, int blockSorter, int coder, int features, int threads);

inikep avatar Oct 13 '25 07:10 inikep

@iczelia I see that MT in bzip3 is achieved using bz3_encode_blocks() in main.c. Please consier adding for example:

BZIP3_API int bz3_compress_mt(u32 block_size, const u8 * const in, u8 * out, size_t in_size, size_t * out_size, int threads);

inikep avatar Oct 13 '25 07:10 inikep

Bzip3, like lbzip2 supports mutlithreading in the client. Nothing wrong with that. In lzbench connect it would be equivalent to `-T' option, with no internal threading. I think that's perfectly logical solution for bwt compressors as they operate on blocks anyway. Plzip does exactly the same - splits input into blocks and then compresses them separately, with no internal threading in lzlib.

tansy avatar Oct 13 '25 10:10 tansy

@tansy. The problem shows up when one wants to compare bsc to other compressors. With -T some ST-only compressors artificially show MT comp/decomp numbers which is unfair because it does not represent real performance outside of the bench. With the -I option, bsc is only ST currently, so unfair to compare to other MT compressors supporting -I.

flanglet avatar Oct 13 '25 14:10 flanglet

@IlyaGrebnov Currenty bsc supports MT with compression_omp(). How hard would be to add for example:

int bsc_compress_mt(const unsigned char * input, unsigned char * output, int n, int lzpHashSize, int lzpMinLen, int blockSorter, int coder, int features, int threads);

It’s possible, but not trivial. In bsc I cap the thread count with omp_set_num_threads(numThreads).

As @flanglet noted, we need to compare apples to apples. bsc supports two parallelism models:

  • Block-level parallelism: compress many independent blocks in parallel; each block runs single-threaded. You control concurrency by the number of blocks in flight.
  • Intra-block parallelism: compress a single block using multiple threads; the output is identical to single-threaded compression, but it can use multiple cores. Only a few compressors support this mode, and even fewer scale efficiently.

IlyaGrebnov avatar Oct 13 '25 16:10 IlyaGrebnov

which is unfair because it does not represent real performance outside of the bench

@flanglet: Then you use single threaded and calculate multi threaded yourself. That's only fair.


Block-level parallelism: compress many independent blocks in parallel;

It's that equivalent to`-T'?

tansy avatar Oct 13 '25 16:10 tansy

Block-level parallelism: compress many independent blocks in parallel;

It's that equivalent to`-T'?

This is implemented in bsc.cpp (not in the libbsc library) and is controlled by the host. For block-level parallelism, launch N threads however you prefer (ideally using method consistent across other codecs) and keep libbsc single-threaded by omitting the LIBBSC_FEATURE_MULTITHREADING flag. For intra-block parallelism, treat the input as a single block, set the thread cap with omp_set_num_threads(numThreads) and enable multi-threading by passing LIBBSC_FEATURE_MULTITHREADING to the library.

IlyaGrebnov avatar Oct 13 '25 17:10 IlyaGrebnov

"Then you use single threaded and calculate multi threaded yourself".

It don't think it works because 1) some compressors are ST for decompression and MT for compression, some others are MT-MT and others ST-ST, which means that the user has to know how each compressor works 2) more importantly all MT compressors do not scale the same way, so one cannot infer MT numbers by simple multiplication.

The 2 parallelism models look like they match the -T and -I options.

flanglet avatar Oct 13 '25 19:10 flanglet

launch N threads however you prefer (ideally using method consistent across other codecs) and keep libbsc single-threaded

This is our -T option and works for all compressors except these using global variables.

For intra-block parallelism, treat the input as a single block, set the thread cap with omp_set_num_threads(numThreads) and enable multi-threading by passing LIBBSC_FEATURE_MULTITHREADING to the library.

Implemented at https://github.com/inikep/lzbench/pull/239

inikep avatar Oct 13 '25 19:10 inikep

So, 'block-level parallelism' == `-T' option in lzbench. Therefore `-I' would be 'Intra-block parallelism'. All is needed is ability to pass number of threads from `-I' to libbsc, so it was clear how many threads user wants to use. It would be good if libbsc respected that and used it, but that's not absolutely necessary.

tansy avatar Oct 14 '25 08:10 tansy

I created https://github.com/iczelia/bzip3/issues/171

inikep avatar Nov 12 '25 11:11 inikep