seqkit icon indicating copy to clipboard operation
seqkit copied to clipboard

Benchmark xz and zstd performance

Open heuermh opened this issue 2 years ago • 3 comments

FYI

I wanted to benchmark xz and zstd performance for dsh-bio, which uses BioJava FASTQ parsing circa 2003 and compression provided by Apache Commons Compress with default settings. dsh-bio parses fully and validates all FASTQ records so I used similar settings for seqkit.

Shell script in this Gist.

Compression results:

Command Real time Time (sec) Disk usage
xz --compress --stdout -0 dataset_C.fq > dataset_C.0.fq.xz 1m24.844s 84 528M
xz --compress --stdout dataset_C.fq > dataset_C.default.fq.xz 20m36.807s 1236 416M
xz --compress --stdout -9 dataset_C.fq > dataset_C.9.fq.xz 31m19.497s 1879 384M
xz --compress --stdout --extreme dataset_C.fq > dataset_C.extreme.fq.xz 26m40.244s 1600 400M
zstd -1 -k dataset_C.fq -o dataset_C.1.fq.zst 0m5.379s 5 565M
zstd -k dataset_C.fq -o dataset_C.default.fq.zst 0m6.863s 6 541M
zstd -6 -k dataset_C.fq -o dataset_C.6.fq.zst 0m28.487s 28 512M
zstd -19 -k dataset_C.fq -o dataset_C.19.fq.zst 20m7.238s 1207 416M
dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.xz 31m23.348s 1883 400M
dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.gz 3m14.372s 194 512M
dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.bgz 1m22.520s 82 544M
dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.bzip2 0m15.845s 15 2.1G
dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.zst 0m23.295s 23 528M
seqkit seq --validate-seq --line-width 0 dataset_C.fq --out-file dataset_C.seqkit.fq.xz 2m26.924s 146 512M
seqkit seq --validate-seq --line-width 0 dataset_C.fq --out-file dataset_C.seqkit.fq.zst 0m8.624s 8 528M

File info:

$ xz --list *.xz
Strms  Blocks   Compressed Uncompressed  Ratio  Check   Filename
    1       1    524.4 MiB  2,212.5 MiB  0.237  CRC64   dataset_C.0.fq.xz
    1       1    377.2 MiB  2,212.5 MiB  0.171  CRC64   dataset_C.9.fq.xz
    1       1    407.1 MiB  2,212.5 MiB  0.184  CRC64   dataset_C.default.fq.xz
    1       1    392.2 MiB  2,195.0 MiB  0.179  CRC64   dataset_C.dsh-bio.fq.xz
    1       1    396.9 MiB  2,212.5 MiB  0.179  CRC64   dataset_C.extreme.fq.xz
    1       1    500.9 MiB  2,195.0 MiB  0.228  CRC64   dataset_C.seqkit.fq.xz
-------------------------------------------------------------------------------
    6       6  2,598.7 MiB     12.9 GiB  0.196  CRC64   6 files

$ zstd -l *.zst
Frames  Skips  Compressed  Uncompressed  Ratio  Check  Filename
     1      0     561 MiB      2.16 GiB  3.943  XXH64  dataset_C.1.fq.zst
     1      0     415 MiB      2.16 GiB  5.335  XXH64  dataset_C.19.fq.zst
     1      0     502 MiB      2.16 GiB  4.406  XXH64  dataset_C.6.fq.zst
     1      0     539 MiB      2.16 GiB  4.108  XXH64  dataset_C.default.fq.zst
     1      0     521 MiB                        None  dataset_C.dsh-bio.fq.zst
     1      0     527 MiB                       XXH64  dataset_C.seqkit.fq.zst
-----------------------------------------------------------------
     6      0    2.99 GiB                              6 files

Decompression results:

Command Real time Time (sec)
dsh-bio compress-fastq -i dataset_C.dsh-bio.fq.xz 0m54.403s 54
dsh-bio compress-fastq -i dataset_C.dsh-bio.fq.zst 0m14.891s 14
dsh-bio compress-fastq -i dataset_C.seqkit.fq.xz 0m55.885s 55
dsh-bio compress-fastq -i dataset_C.seqkit.fq.zst 0m16.153s 16
seqkit seq --validate-seq --line-width 0 dataset_C.dsh-bio.fq.xz 0m55.961s 55
seqkit seq --validate-seq --line-width 0 dataset_C.dsh-bio.fq.zst 0m2.951s 2
seqkit seq --validate-seq --line-width 0 dataset_C.seqkit.fq.xz 1m6.193s 66
seqkit seq --validate-seq --line-width 0 dataset_C.seqkit.fq.zst 0m3.097s 3

TL;DR

seqkit is fast, and prefer zstd 😉

heuermh avatar Mar 31 '22 21:03 heuermh

That looks good! The fast speed is owed to the great packages:

  • https://github.com/ulikunitz/xz
  • https://github.com/klauspost/compress/tree/master/zstd

shenwei356 avatar Apr 01 '22 06:04 shenwei356

@heuermh

Can you try xz --best -T0 and see how fast it is ?

akhst7 avatar Jul 21 '22 20:07 akhst7

I just noticed that the pgzip package sets the default compression level of gzip as 5 rather than 6 (default of gzip), which (5) should be faster.

I've added a global flag (--compress-level) to set the compression level for gzip, zstd, and bzip2. #320

shenwei356 avatar Mar 14 '23 14:03 shenwei356

Thank you! Closing this issue as complete.

heuermh avatar Apr 10 '24 17:04 heuermh