gz-sort
gz-sort copied to clipboard
A utility for sorting really big files. http://kmkeen.com/gz-sort/
More details at http://kmkeen.com/gz-sort/
perform a merge sort over a multi-GB gz compressed file
Quickstart
git clone https://github.com/keenerd/gz-sort; cd gz-sort; make; ./gz-sort -h
Needs the zlib headers and probably only builds on GNU/Linux.
use: gz-sort [-u] [-S n] [-P n] source.gz dest.gz
options:
-h: help
-u: unique
-S n: size of presort, supports k/M/G suffix
a traditional in-memory sort (default n=1M)
-P n: use multiple threads (experimental, default disabled)
-T: pass through (debugging/benchmarks)
estimating run time, crudely:
time gzip -dc data.gz | gzip > /dev/null
unthreaded: seconds * entropy * (log2(uncompressed_size/S)+2)
(where 'entropy' is a fudge-factor between 1.5 for an
already sorted file and 3 for a shuffled file)
S and P are the corresponding settings
multithreaded: maybe unthreaded/sqrt(P) ?
estimated disk use:
2x source.gz
Minimum requirements to sort a terabyte:
- 4MB ram (yes, megabyte)
- free disk space equal to the twice the compressed source.gz
Known bugs to fix
Email me if you are using gz-sort and any of these omissions are causing you trouble. For that matter, email me if you find something not on this list too.
- Does not build on non-gnu systems.
- Sqrt(threads) is a terrible ratio.
- No support for uncompressed stdin streams.
- Breaks if a line is longer than the buffer size.
- Lacks all error handling.
- Ugly code with lots of ways to refactor.
- Output could use predictable flushes.
Performance tweaks to try
- Profile!
- Parallelize the final n-way merge. This will require adding IPC.
- Filter unique lines during the earlier passes.
- Try out zlib-ng, about half of cpu time is spent on (un)gzipping.
- Improve memory estimation, it lowballs and that hurts the presort.
- Byte-based seeking instead of line-counting.