seqkit
seqkit copied to clipboard
Benchmark xz and zstd performance
FYI
I wanted to benchmark xz
and zstd
performance for dsh-bio
, which uses BioJava FASTQ parsing circa 2003 and compression provided by Apache Commons Compress with default settings. dsh-bio
parses fully and validates all FASTQ records so I used similar settings for seqkit
.
Shell script in this Gist.
Compression results:
Command | Real time | Time (sec) | Disk usage |
---|---|---|---|
xz --compress --stdout -0 dataset_C.fq > dataset_C.0.fq.xz |
1m24.844s | 84 | 528M |
xz --compress --stdout dataset_C.fq > dataset_C.default.fq.xz |
20m36.807s | 1236 | 416M |
xz --compress --stdout -9 dataset_C.fq > dataset_C.9.fq.xz |
31m19.497s | 1879 | 384M |
xz --compress --stdout --extreme dataset_C.fq > dataset_C.extreme.fq.xz |
26m40.244s | 1600 | 400M |
zstd -1 -k dataset_C.fq -o dataset_C.1.fq.zst |
0m5.379s | 5 | 565M |
zstd -k dataset_C.fq -o dataset_C.default.fq.zst |
0m6.863s | 6 | 541M |
zstd -6 -k dataset_C.fq -o dataset_C.6.fq.zst |
0m28.487s | 28 | 512M |
zstd -19 -k dataset_C.fq -o dataset_C.19.fq.zst |
20m7.238s | 1207 | 416M |
dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.xz |
31m23.348s | 1883 | 400M |
dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.gz |
3m14.372s | 194 | 512M |
dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.bgz |
1m22.520s | 82 | 544M |
dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.bzip2 |
0m15.845s | 15 | 2.1G |
dsh-bio compress-fastq -i dataset_C.fq -o dataset_C.dsh-bio.fq.zst |
0m23.295s | 23 | 528M |
seqkit seq --validate-seq --line-width 0 dataset_C.fq --out-file dataset_C.seqkit.fq.xz |
2m26.924s | 146 | 512M |
seqkit seq --validate-seq --line-width 0 dataset_C.fq --out-file dataset_C.seqkit.fq.zst |
0m8.624s | 8 | 528M |
File info:
$ xz --list *.xz
Strms Blocks Compressed Uncompressed Ratio Check Filename
1 1 524.4 MiB 2,212.5 MiB 0.237 CRC64 dataset_C.0.fq.xz
1 1 377.2 MiB 2,212.5 MiB 0.171 CRC64 dataset_C.9.fq.xz
1 1 407.1 MiB 2,212.5 MiB 0.184 CRC64 dataset_C.default.fq.xz
1 1 392.2 MiB 2,195.0 MiB 0.179 CRC64 dataset_C.dsh-bio.fq.xz
1 1 396.9 MiB 2,212.5 MiB 0.179 CRC64 dataset_C.extreme.fq.xz
1 1 500.9 MiB 2,195.0 MiB 0.228 CRC64 dataset_C.seqkit.fq.xz
-------------------------------------------------------------------------------
6 6 2,598.7 MiB 12.9 GiB 0.196 CRC64 6 files
$ zstd -l *.zst
Frames Skips Compressed Uncompressed Ratio Check Filename
1 0 561 MiB 2.16 GiB 3.943 XXH64 dataset_C.1.fq.zst
1 0 415 MiB 2.16 GiB 5.335 XXH64 dataset_C.19.fq.zst
1 0 502 MiB 2.16 GiB 4.406 XXH64 dataset_C.6.fq.zst
1 0 539 MiB 2.16 GiB 4.108 XXH64 dataset_C.default.fq.zst
1 0 521 MiB None dataset_C.dsh-bio.fq.zst
1 0 527 MiB XXH64 dataset_C.seqkit.fq.zst
-----------------------------------------------------------------
6 0 2.99 GiB 6 files
Decompression results:
Command | Real time | Time (sec) |
---|---|---|
dsh-bio compress-fastq -i dataset_C.dsh-bio.fq.xz |
0m54.403s | 54 |
dsh-bio compress-fastq -i dataset_C.dsh-bio.fq.zst |
0m14.891s | 14 |
dsh-bio compress-fastq -i dataset_C.seqkit.fq.xz |
0m55.885s | 55 |
dsh-bio compress-fastq -i dataset_C.seqkit.fq.zst |
0m16.153s | 16 |
seqkit seq --validate-seq --line-width 0 dataset_C.dsh-bio.fq.xz |
0m55.961s | 55 |
seqkit seq --validate-seq --line-width 0 dataset_C.dsh-bio.fq.zst |
0m2.951s | 2 |
seqkit seq --validate-seq --line-width 0 dataset_C.seqkit.fq.xz |
1m6.193s | 66 |
seqkit seq --validate-seq --line-width 0 dataset_C.seqkit.fq.zst |
0m3.097s | 3 |
TL;DR
seqkit
is fast, and prefer zstd
😉
That looks good! The fast speed is owed to the great packages:
- https://github.com/ulikunitz/xz
- https://github.com/klauspost/compress/tree/master/zstd
@heuermh
Can you try xz --best -T0 and see how fast it is ?
I just noticed that the pgzip package sets the default compression level of gzip as 5 rather than 6 (default of gzip
), which (5) should be faster.
I've added a global flag (--compress-level
) to set the compression level for gzip, zstd, and bzip2. #320
Thank you! Closing this issue as complete.