lzdatagen
lzdatagen copied to clipboard
LZ data generator
LZ data generator
About
Sometimes it can be useful to be able to generate data that is similar to real data for testing or benchmarking purposes. For instance it may be impractical to distribute large data sets with an application.
lzdatagen generates data suitable for dictionary compression techniques.
Usage
lzdatagen comes with an example application lzdgen that provides a command-line interface for generating data:
usage: lzdgen [options] OUTFILE
Generate compressible data for testing purposes.
options:
-b, --bulk use faster, less precise method
-f, --force overwrite output file
-h, --help print this help and exit
-l, --literal-exp EXP literal distribution exponent [3.0]
-m, --match-exp EXP match length distribution exponent [3.0]
-o, --output OUTFILE write output to OUTFILE
-r, --ratio RATIO compression ratio target [3.0]
-S, --seed SEED use 64-bit SEED to seed PRNG
-s, --size SIZE size with opt. k/m/g suffix [1m]
-V, --version print version and exit
-v, --verbose verbose mode
If OUTFILE is `-', write to standard output.
Examples
Generate 1 MiB data which should compress roughly 1:4:
lzdgen -r 4.0 foo.bin
Generate 1 MiB data compressible by entropy coding, but without LZ repetitions:
lzdgen -r 1.0 foo.bin
Generate 1 GiB of data, piped to zstd:
lzdgen -s 1g - | zstd -o foo.zstd
Details
Data is generated by inserting sequences of either random bytes or repetitions from a buffer of bytes, depending on the ratio parameter. This is based on the paper "SDGen: Mimicking Datasets for Content Generation in Storage Benchmarks" by Raúl Gracia-Tinedo et al.
Instead of sampling actual data, lzdatagen uses a simple power function to
determine the distributions of literal values and match lengths. The exponents
used can be set using the --literal-exp
and --match-exp
options.
This simplification means it cannot generate data with a limited alphabet, like DNA sequences.
The ratio parameter is approximate. Skewed literal distributions may create matches, and the way matches are created from a buffer may affect the distribution of byte values.
Please note that while data generated in this way may be useful for some kinds of testing and benchmarking, it is no substitute for unit tests that cover the limits of an algorithm.
lzdatagen uses a PCG random number generator. In verbose mode it will print
the seed value to stderr. The --seed
option can be used to generate
reproducible data.
A few other projects in this area:
License
This projected is licensed under the Apache License, Version 2.0.