libdeflate icon indicating copy to clipboard operation
libdeflate copied to clipboard

I added stream & multi-thread support for libdeflate, need help!

Open sisong opened this issue 1 year ago • 1 comments

Thank you for sharing the libdeflate, it's great!
My project want run on phone, so I add some API to libdeflate for support compress&decompress by stream (ref #19), & support compress by multi-thread (ref #40);
And at the same time, try to keep it simple and fast.
With these modifications at stream_mt, I rewrote gzip.c to pgzip.c for testing stream and multi-thread parallel.

it's can run ok when compression_level<=9, but got a bad compress ratio when compression_level>=10; Because I don't know how to rebuild the hash dictionary for bt_matchfinder.
I added func bt_matchfinder_skip_bytes(), it only simple loop call bt_matchfinder_skip_byte(), so it's fail.
I need some help, How to implement bt_matchfinder_skip_bytes()? it's similar ht_matchfinder_skip_bytes() or hc_matchfinder_skip_bytes().

current work progress, some files for compression testing:

file name file original size
1 Chrome_107.0.5304.122-x64-Stable.win.tar 278658560
2 Emacs_28.2-universal.mac.tar 196380160
3 gcc_12.2.0.src.tar 865884160
4 jdk_x64_mac_openj9_16.0.1_9_openj9-0.26.0.tar 363765760
5 linux_5.19.9.src.tar 1269637120

test PC: Windows11, CPU R9-7945HX, SSD PCIe4.0x4 4T, DDR5 5200MHz 32Gx2
Program version: zlib v1.2.13, gzip in libdeflate v1.19, pgzip in stream_mt based on libdeflate v1.19
Only test deflate compress & decompress, no crc; build by vc2022; The time counted includes the time of read & write file data; -p-16 means compressor run with 16 threads.

Program C ratio C ave. mem C ave. speed D ave. mem D max mem D ave. speed
zlib-1 29.981% 2M 163MB/s 1M 1M 531MB/s
zlib-2 29.077% 2M 151MB/s 1M 1M 548MB/s
zlib-3 28.402% 2M 124MB/s 1M 1M 558MB/s
zlib-4 27.147% 2M 108MB/s 1M 1M 556MB/s
zlib-5 26.442% 2M 84MB/s 1M 1M 570MB/s
zlib-6 26.077% 2M 58MB/s 1M 1M 574MB/s
zlib-7 25.972% 2M 48MB/s 1M 1M 576MB/s
zlib-8 25.879% 2M 32MB/s 1M 1M 589MB/s
zlib-9 25.852% 2M 27MB/s 1M 1M 589MB/s
gzip -1 28.325% 571M 342MB/s 569M 1214M 692MB/s
gzip -2 27.465% 571M 254MB/s 569M 1214M 703MB/s
gzip -3 27.030% 571M 234MB/s 569M 1214M 714MB/s
gzip -4 26.740% 571M 217MB/s 569M 1214M 708MB/s
gzip -5 26.390% 571M 193MB/s 569M 1214M 719MB/s
gzip -6 26.096% 571M 156MB/s 569M 1214M 723MB/s
gzip -7 25.956% 571M 113MB/s 569M 1214M 718MB/s
gzip -8 25.861% 571M 69MB/s 569M 1214M 721MB/s
gzip -9 25.847% 571M 55MB/s 569M 1214M 722MB/s
pgzip -1 -p-1 28.325% 5M 380MB/s 33M 33M 999MB/s
pgzip -2 -p-1 27.466% 5M 274MB/s 33M 33M 1009MB/s
pgzip -3 -p-1 27.030% 5M 252MB/s 33M 33M 1022MB/s
pgzip -4 -p-1 26.740% 5M 233MB/s 33M 33M 1026MB/s
pgzip -5 -p-1 26.390% 5M 201MB/s 33M 33M 1028MB/s
pgzip -6 -p-1 26.096% 5M 161MB/s 33M 33M 1032MB/s
pgzip -7 -p-1 25.956% 5M 115MB/s 33M 33M 1040MB/s
pgzip -8 -p-1 25.861% 5M 71MB/s 33M 33M 1024MB/s
pgzip -9 -p-1 25.846% 5M 56MB/s 33M 33M 1034MB/s
pgzip -1 -p-4 28.326% 26M 1415MB/s 33M 33M 999MB/s
pgzip -2 -p-4 27.466% 28M 1045MB/s 33M 33M 1011MB/s
pgzip -3 -p-4 27.030% 28M 948MB/s 33M 33M 1011MB/s
pgzip -4 -p-4 26.740% 28M 878MB/s 33M 33M 1023MB/s
pgzip -5 -p-4 26.390% 28M 763MB/s 33M 33M 1033MB/s
pgzip -6 -p-4 26.097% 28M 611MB/s 33M 33M 1034MB/s
pgzip -7 -p-4 25.956% 28M 442MB/s 33M 33M 1041MB/s
pgzip -8 -p-4 25.861% 28M 272MB/s 33M 33M 1012MB/s
pgzip -9 -p-4 25.847% 28M 216MB/s 33M 33M 1010MB/s
pgzip -1 -p-16 28.326% 101M 3833MB/s 33M 33M 968MB/s
pgzip -2 -p-16 27.466% 108M 2995MB/s 33M 33M 977MB/s
pgzip -3 -p-16 27.030% 108M 2859MB/s 33M 33M 984MB/s
pgzip -4 -p-16 26.740% 108M 2646MB/s 33M 33M 978MB/s
pgzip -5 -p-16 26.390% 108M 2344MB/s 33M 33M 999MB/s
pgzip -6 -p-16 26.097% 108M 2005MB/s 33M 33M 1006MB/s
pgzip -7 -p-16 25.956% 108M 1480MB/s 33M 33M 1006MB/s
pgzip -8 -p-16 25.861% 108M 899MB/s 33M 33M 992MB/s
pgzip -9 -p-16 25.847% 108M 696MB/s 33M 33M 993MB/s

sisong avatar Jan 03 '24 05:01 sisong

Now,I supported compression_level>=10 ~~,but don't implement the wanted bt_matchfinder_skip_bytes();
it's only copy your compress block code to compress dict data as the first block & not output deflate stream;
so, this compress func implement will be a little slower.~~

add test result:

Program C ratio C ave. mem C ave. speed D ave. mem D max mem D ave. speed
gzip -10 25.073% 579M 18.0MB/s 569M 1214M 705MB/s
gzip -11 24.985% 579M 10.2MB/s 569M 1214M 710MB/s
gzip -12 24.960% 579M 7.3MB/s 569M 1214M 710MB/s
pgzip -10 -p-1 25.075% 13M 17.6MB/s 33M 33M 1001MB/s
pgzip -11 -p-1 24.987% 13M 10.1MB/s 33M 33M 987MB/s
pgzip -12 -p-1 24.962% 13M 7.2MB/s 33M 33M 977MB/s
pgzip -10 -p-4 25.075% 60M 67.7MB/s 33M 33M 962MB/s
pgzip -11 -p-4 24.987% 60M 38.5MB/s 33M 33M 952MB/s
pgzip -12 -p-4 24.962% 60M 27.7MB/s 33M 33M 936MB/s
pgzip -10 -p-16 25.075% 236M 220.4MB/s 33M 33M 931MB/s
pgzip -11 -p-16 24.987% 236M 125.6MB/s 33M 33M 931MB/s
pgzip -12 -p-16 24.962% 236M 90.4MB/s 33M 33M 932MB/s

sisong avatar Jan 06 '24 13:01 sisong

@sisong How do I compile your pgzip binary? I don't see it exposed in the CMake config

ghuls avatar Mar 21 '24 10:03 ghuls

libdeflate-pgzip streaming decompression, beats igzip (from ISA-L) decompression speed (the latter is/was the fastest gzip streaming decompressor that I know of). Quite impressive!

$ /software/libdeflate_streaming/build/programs/libdeflate-pgzip -d -c fragments.tsv.gz > /dev/null

Time output:
------------

  * Command: /software/libdeflate_streaming/build/programs/libdeflate-pgzip -d -c fragments.tsv.gz
  * Elapsed wall time: 0:05.60 = 5.60 seconds
  * Elapsed CPU time:
     - User: 5.34
     - Sys: 0.24
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 5
     - Involuntarily (time slice expired): 27
  * Maximum resident set size (RSS: memory) (kiB): 29284
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0


$ timeit /software/libdeflate_streaming/build/programs/libdeflate-gzip -d -c fragments.tsv.gz > /dev/null

Time output:
------------

  * Command: /software/libdeflate_streaming/build/programs/libdeflate-gzip -d -c fragments.tsv.gz
  * Elapsed wall time: 0:12.22 = 12.22 seconds
  * Elapsed CPU time:
     - User: 10.81
     - Sys: 1.38
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 3
     - Involuntarily (time slice expired): 77
  * Maximum resident set size (RSS: memory) (kiB): 5744956
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0


$ timeit /software/isa-l/programs/igzip -d -c fragments.tsv.gz > /dev/null

Time output:
------------

  * Command: /software/isa-l/programs/igzip -d -c fragments.tsv.gz
  * Elapsed wall time: 0:07.32 = 7.32 seconds
  * Elapsed CPU time:
     - User: 7.10
     - Sys: 0.20
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 78
     - Involuntarily (time slice expired): 62
  * Maximum resident set size (RSS: memory) (kiB): 4176
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 888
     - # of outputs: 0
  * Exit status: 0

One thing that doesn't work is decompression gzip files with multiple gzip headers, as only the first gzip stream is decompressed. bgzip of HTSlib makes block gzipped files like this and is used a lot in bioinformatics to compress files.

echo "1" | gzip > multiple_gzip_header.gz
echo "2" | gzip >> multiple_gzip_header.gz
echo "3" | gzip >> multiple_gzip_header.gz

$ /software/libdeflate_streaming/build/programs/libdeflate-pgzip -cd multiple_gzip_header.gz
1

# Standard gzip does not have problems with those files.
$ zcat multiple_gzip_header.gz
1
2
3


# igzip of ISA-L works too.
$ /software/isa-l/programs/igzip -cd multiple_gzip_header.gz
1
2
3

ghuls avatar Mar 21 '24 14:03 ghuls

@ghuls
I didn't find the libdeflate-pgzip streaming decompression,
stream-mt base on libdeflate, added support compress & decompress by stream, and added base API for support parallel compress.

How do I compile your pgzip binary? I don't see it exposed in the CMake config

I haven't used CMake;
Maybe you can try modifying the libdeflate\programs\CMakeLists.txt file by replacing the gzip.c source file to pgzip.c, and adding new source files:
libdeflate\programs\gzip_compress_by_stream_mt.cpp
libdeflate\programs\gzip_decompress_by_stream_mt.cpp

sisong avatar Mar 21 '24 23:03 sisong

now stream-mt update libdeflate base to v1.20;
and optimized stream decompressor: memory requests greatly reduced, & pgzip decompress support multi-thread(I/O only).
some new benchmark result:
Note: C ratio=average(gzfile/srcfile)

Program C ratio C ave. mem C ave. speed D ave. mem D max mem D ave. speed
filecopy 100.000% 1.0M 2954.7MB/s 1.0M 1.0M 2862MB/s
zlib-1 38.628% 1.4M 157.9MB/s 1.1M 1.1M 565MB/s
zlib-6 35.112% 1.3M 56.9MB/s 1.1M 1.1M 618MB/s
zlib-9 34.923% 1.3M 26.7MB/s 1.1M 1.1M 634MB/s
gzip -1 37.558% 570.9M 331.3MB/s 569.3M 1214.2M 763MB/s
gzip -6 34.819% 571.3M 151.9MB/s 569.3M 1214.2M 780MB/s
gzip -9 34.589% 571.3M 55.0MB/s 569.3M 1214.2M 771MB/s
gzip -12 33.863% 579.3M 7.3MB/s 569.3M 1214.2M 786MB/s
pgzip -1 -p-1 37.558% 4.9M 346.1MB/s 1.3M 1.3M 1110MB/s
pgzip -6 -p-1 34.820% 5.3M 156.4MB/s 1.6M 2.4M 1122MB/s
pgzip -9 -p-1 34.589% 5.3M 55.8MB/s 1.7M 2.4M 1132MB/s
pgzip -12 -p-1 33.865% 13.3M 6.5MB/s 1.5M 2.4M 1156MB/s
pgzip -1 -p-4 37.558% 25.9M 1309.6MB/s 2.4M 2.4M 1428MB/s
pgzip -6 -p-4 34.820% 27.7M 598.4MB/s 2.6M 3.0M 1449MB/s
pgzip -9 -p-4 34.589% 27.7M 209.2MB/s 2.7M 3.0M 1438MB/s
pgzip -12 -p-4 33.866% 59.6M 25.0MB/s 2.5M 3.0M 1449MB/s
pgzip -1 -p-16 37.558% 101.4M 3854.0MB/s 2.4M 2.4M 1459MB/s
pgzip -6 -p-16 34.820% 108.5M 2024.7MB/s 2.6M 3.0M 1438MB/s
pgzip -9 -p-16 34.589% 108.5M 692.7MB/s 2.7M 3.0M 1416MB/s
pgzip -12 -p-16 33.865% 236.1M 89.9MB/s 2.5M 3.0M 1429MB/s

sisong avatar Jul 22 '24 05:07 sisong

Hello,

Can you add this zlib version too? Not have stream functionality, but fast on ryzen4 7950x

https://dougallj.wordpress.com/2022/08/20/faster-zlib-deflate-decompression-on-the-apple-m1-and-x86/

osevan avatar Aug 05 '24 12:08 osevan

@osevan zlib-dougallj build by latest version of vc2022; it's easier to replace the zlib library with it, just need change
deflate.c 1267 line code: int match_byte = __builtin_ctzl(xor) / 8;
to unsigned long bi; _BitScanForward64(&bi, xor); int match_byte = bi>>3;

zlib-dougallj runs performance tests with the previous programs, and the other tests' results have almost no change.
So only the zlib-dougallj results are updated as follows:
( In this test alone, zlib-dougallj's compress ratio and performance were unremarkable )

Program C ratio C ave. mem C ave. speed D ave. mem D max mem D ave. speed
dougallj-1 38.616% 1.4M 197.7MB/s 1.1M 1.1M 569MB/s
dougallj-6 35.657% 1.4M 91.2MB/s 1.1M 1.1M 622MB/s
dougallj-9 35.520% 1.4M 44.2MB/s 1.1M 1.1M 634MB/s

sisong avatar Aug 06 '24 04:08 sisong

Ok big thx

osevan avatar Aug 07 '24 10:08 osevan

And what is with rapidgzip?

osevan avatar Aug 07 '24 10:08 osevan

And what is with rapidgzip?

@osevan rapidgzip mainly implements multi-threaded parallel decompression, a producer finds the junction position of the block under a certain error probability, and then hands over the broken datas to the decompression thread group of two stages; Occupies a amount of resources; I'm not too interested in that, because decompression hasn't caused too much of a bottleneck.

sisong avatar Aug 08 '24 05:08 sisong

@sisong Could you share exactly how you compile and build pgzip?

ghuls avatar Aug 08 '24 12:08 ghuls

@ghuls
I use vc, xcode, and makefile to compile programs; You can add the *.c files in the lib directory to the compile&link together with pgzip.c, prog_util.c, tgetopt.c, gzip_compress_by_stream_mt.cpp, and gzip_decompress_by_stream_mt.cpp in your project.
I not familiar with cmake, you can refer to https://github.com/ebiggers/libdeflate/issues/335#issuecomment-2014020744

sisong avatar Aug 08 '24 13:08 sisong

@ghuls

now, I submitted my build environment, including: vc, xcode, and MakeFile Under Linux, you can download the source code and compile the pgzip executable like this:

git clone https://github.com/sisong/libdeflate.git  libdeflate
cd libdeflate/pgzip
make -j

You can also download the pgzip executable that I compiled directly: https://github.com/sisong/libdeflate/releases

sisong avatar Aug 11 '24 01:08 sisong

@sisong Thanks for adding the makefile. I managed to compile it successfully.

As mentioned earlier, pgzip does not support decompressing concatenated gzip files (like the BGZF format, commonly used in bioinformatics). It would be great if pgzip could support decompressing concatenated gzip files, as that would make pgzip very useful in pipelines as it supports streaming.

# Create file with multiple concatenated gzip archives.
$ echo "1" | gzip > multiple_gzip_header.gz
$ echo "2" | gzip >> multiple_gzip_header.gz
$ echo "3" | gzip >> multiple_gzip_header.gz

# gzip can decompress the full file.
$  zcat multiple_gzip_header.gz
1
2
3

# libdeflate gzip decompressed the full file.
$  ./build/programs/libdeflate-gzip -c -d multiple_gzip_header.gz 
1
2
3

# pgzip only decompresses the first gzip archive.
$ ./pgzip/pgzip -c -d multiple_gzip_header.gz 
1

ghuls avatar Aug 11 '24 22:08 ghuls

@ghuls now, pgzip supported concatenated gzip files.

sisong avatar Aug 12 '24 02:08 sisong

@sisong Thanks! It now works for BGZF compressed files I tried so far.

Some timings:

# BGZF compressed file (concatenated gzipped files).
$ file bgzipped.fastq.gz
bgzipped.fastq.gz: gzip compressed data, extra field, original size 10822



# Decompression time with standard gzip.
module load gzip/1.12

$ timeit gzip -cd bgzipped.fastq.gz | wc -l

Time output:
------------

  * Command: gzip -cd bgzipped.fastq.gz
  * Elapsed wall time: 1:15.27 = 75.27 seconds
  * Elapsed CPU time:
     - User: 72.81
     - Sys: 2.29
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 9
     - Involuntarily (time slice expired): 417
  * Maximum resident set size (RSS: memory) (kiB): 1920
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 216
     - # of outputs: 0
  * Exit status: 0

384498832


# Decompression time with pigz with standard zlib.
module load pigz/2.7

$ timeit pigz -cd bgzipped.fastq.gz | wc -l

Time output:
------------

  * Command: pigz -cd -cd bgzipped.fastq.gz
  * Elapsed wall time: 0:44.13 = 44.13 seconds
  * Elapsed CPU time:
     - User: 39.82
     - Sys: 14.25
  * CPU usage: 122%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 1405471
     - Involuntarily (time slice expired): 3186
  * Maximum resident set size (RSS: memory) (kiB): 2384
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 304
     - # of outputs: 0
  * Exit status: 0

384498832


# Decompression time with pigz with zlib-ng.
module load pigz/2.7
module load zlib-ng/2.1.6

$ timeit pigz -cd bgzipped.fastq.gz | wc -l

Time output:
------------

  * Command: pigz -cd bgzipped.fastq.gz
  * Elapsed wall time: 0:30.60 = 30.60 seconds
  * Elapsed CPU time:
     - User: 24.79
     - Sys: 14.18
  * CPU usage: 127%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 1410511
     - Involuntarily (time slice expired): 2367
  * Maximum resident set size (RSS: memory) (kiB): 2452
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

384498832


# Decompression time with igzip of ISA-L.
module load ISA-L/2.30.0

$ timeit igzip -cd bgzipped.fastq.gz | wc -l

Time output:
------------

  * Command: igzip -cd bgzipped.fastq.gz
  * Elapsed wall time: 0:16.45 = 16.45 seconds
  * Elapsed CPU time:
     - User: 14.01
     - Sys: 2.39
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 25
     - Involuntarily (time slice expired): 111
  * Maximum resident set size (RSS: memory) (kiB): 3236
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

384498832

# Decompression time with pgzip.
$ timeit pgzip/pgzip -cd bgzipped.fastq.gz | wc -l

Time output:
------------

  * Command: pgzip/pgzip -cd bgzipped.fastq.gz
  * Elapsed wall time: 0:12.59 = 12.59 seconds
  * Elapsed CPU time:
     - User: 16.48
     - Sys: 18.51
  * CPU usage: 277%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 572318
     - Involuntarily (time slice expired): 137
  * Maximum resident set size (RSS: memory) (kiB): 4252
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

384498832

Decompressing with 2 threads seems to give the best performance (and lower CPU usage than the default 4):

$ timeit pgzip/pgzip -cd -p 1 bgzipped.fastq.gz | wc -l

Time output:
------------

  * Command: pgzip/pgzip -cd -p 1 bgzipped.fastq.gz
  * Elapsed wall time: 0:13.15 = 13.15 seconds
  * Elapsed CPU time:
     - User: 11.18
     - Sys: 1.94
  * CPU usage: 99%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 11
     - Involuntarily (time slice expired): 63
  * Maximum resident set size (RSS: memory) (kiB): 3092
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

384498832


$ timeit pgzip/pgzip -cd -p 2 bgzipped.fastq.gz | wc -l

Time output:
------------

  * Command: pgzip/pgzip -cd -p 2 bgzipped.fastq.gz
  * Elapsed wall time: 0:12.14 = 12.14 seconds
  * Elapsed CPU time:
     - User: 13.21
     - Sys: 8.40
  * CPU usage: 178%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 565213
     - Involuntarily (time slice expired): 63
  * Maximum resident set size (RSS: memory) (kiB): 3820
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

384498832

$ timeit pgzip/pgzip -cd -p 3 bgzipped.fastq.gz | wc -l                                                                                                                                                                                                       

Time output:
------------

  * Command: pgzip/pgzip -cd -p 3 bgzipped.fastq.gz
  * Elapsed wall time: 0:12.58 = 12.58 seconds
  * Elapsed CPU time:
     - User: 16.30
     - Sys: 18.63
  * CPU usage: 277%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 573049
     - Involuntarily (time slice expired): 87
  * Maximum resident set size (RSS: memory) (kiB): 4288
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

384498832

$ timeit pgzip/pgzip -cd -p 4 bgzipped.fastq.gz | wc -l                                                                                                                                                                                         

Time output:
------------

  * Command: pgzip/pgzip -cd -p 4 bgzipped.fastq.gz
  * Elapsed wall time: 0:12.57 = 12.57 seconds
  * Elapsed CPU time:
     - User: 16.35
     - Sys: 18.62
  * CPU usage: 278%
  * Context switching:
     - Voluntarily (e.g.: waiting for I/O operation): 566209
     - Involuntarily (time slice expired): 80
  * Maximum resident set size (RSS: memory) (kiB): 4232
  * Number of times the process was swapped out of main memory: 0
  * Filesystem:
     - # of inputs: 0
     - # of outputs: 0
  * Exit status: 0

384498832

ghuls avatar Aug 12 '24 14:08 ghuls