keenerd/gz-sort: A utility for sorting really big files. http://km...

More details at http://kmkeen.com/gz-sort/

perform a merge sort over a multi-GB gz compressed file

Quickstart

git clone https://github.com/keenerd/gz-sort; cd gz-sort; make; ./gz-sort -h

Needs the zlib headers and probably only builds on GNU/Linux.

use: gz-sort [-u] [-S n] [-P n] source.gz dest.gz

options:
   -h: help
   -u: unique
   -S n: size of presort, supports k/M/G suffix
         a traditional in-memory sort (default n=1M)
   -P n: use multiple threads (experimental, default disabled)
   -T: pass through (debugging/benchmarks)

estimating run time, crudely:
    time gzip -dc data.gz | gzip > /dev/null
    unthreaded: seconds * entropy * (log2(uncompressed_size/S)+2)
    (where 'entropy' is a fudge-factor between 1.5 for an
    already sorted file and 3 for a shuffled file)
    S and P are the corresponding settings
    multithreaded: maybe unthreaded/sqrt(P) ?

estimated disk use:
    2x source.gz

Minimum requirements to sort a terabyte:

4MB ram (yes, megabyte)
free disk space equal to the twice the compressed source.gz

Known bugs to fix

Email me if you are using gz-sort and any of these omissions are causing you trouble. For that matter, email me if you find something not on this list too.

Does not build on non-gnu systems.
Sqrt(threads) is a terrible ratio.
No support for uncompressed stdin streams.
Breaks if a line is longer than the buffer size.
Lacks all error handling.
Ugly code with lots of ways to refactor.
Output could use predictable flushes.

Performance tweaks to try

Profile!
Parallelize the final n-way merge. This will require adding IPC.
Filter unique lines during the earlier passes.
Try out zlib-ng, about half of cpu time is spent on (un)gzipping.
Improve memory estimation, it lowballs and that hurts the presort.
Byte-based seeking instead of line-counting.

gz-sort
gz-sort copied to clipboard

Metadata

Quickstart

Minimum requirements to sort a terabyte:

Known bugs to fix

Performance tweaks to try

← Metadata

Owner

Metadata

gz-sort gz-sort copied to clipboard

Metadata

Quickstart

Minimum requirements to sort a terabyte:

Known bugs to fix

Performance tweaks to try

← Metadata

Owner

Metadata

gz-sort
gz-sort copied to clipboard