[feature request] Progress bars & easier decompression
A pretty typical Miller invocation for me looks like…
$ pv foobar.tsv.gz | pigz -d | mlr -t count -f foo then sort -nr count then head
396MiB 0:00:53 [7.41MiB/s] [================================>] 100%
foo count
bar 123
pv is a superuseful command to display a progress bar.
pigz -d decompresses in parallel faster than gzip.
pv foobar.tsv.gz | mlr --prepipe "pigz -d" -t count -f foo then sort -nr count then head works!
mlr --prepipe 'pv | pigz -d' --csv count -g barcode 1084526_1091679/outs/probe_info.csv.gz works in that it gives the correct answer, but the progress bar is not displayed, because pv doesn't know the file size unless it's given the name of the file as a command line argument or the file (but not a pipe) is it's standard input.
So this does work, it gives the right answer and displays the progress bar:
pv <1084526_1091679/outs/probe_info.csv.gz | mlr --prepipe 'pigz -d' --csv count -g barcode
So on to my feature request.
- When the input to
mlris coming from a file, could it open theprepipeprocess withstdinbe that actual file, rather than a pipe? - Would you consider adding a
-Poption that display a progress bar, possibly by piping intopv, in which case it's just a shortcut for--prepipe pv? - Would you consider add a
-zoption to guess the appropriate decompress program to use based on the input filename or alternatively guessed based on the first few bytes of the file? (likelibmagicand file`)
These two features would cut my example command down from…
pv foobar.tsv.gz | pigz -d | mlr -t count -f foo then sort -nr count then head
to
mlr -Ptz count -f foo then sort -nr count then head foobar.tsv.gz
Originally posted by @sjackman in https://github.com/johnkerl/miller/issues/77#issuecomment-742163372
pigz -ddecompresses in parallel faster thangzip.
Slight misstatement here. pigz -d is faster than gzip, but it's not parallelized AFAIK. Parallelized decompression of standard non-block gzip is difficult. pigz to compress is parallelized.
Hi @sjackman ! Sorry for the long delay.
(1) There is --prepipe and --prepipex
(3) Using Miller 6 the auto-decompress by filename extension comes for free -- no -z required. These work if there is a filename as input; if input is from stdin you can use --gzin et al.
(2) I'd prefer to let pv do that awesome thing it does well without re-implementing that feature in Miller
Given those, as of Miller 6 we have
pv foobar.tsv.gz | mlr --gzin -t count -f foo then sort -nr count then head
which is a bit of an improvement perhaps ...
Amazing! Lots of good stuff in those Miller 6 release notes. That's great news about automatically decompressing compressed files!
I was thinking that mlr -P could execute pv if it were installed, rather than reinvent that wheel. I wouldn't call it a high-priority wishlist item though. Feel free to close this issue if you like.
I was thinking that mlr -P could execute pv if it were installed, rather than reinvent that wheel
ooh nice ... will think on ... :)
pigz -ddecompresses in parallel faster thangzip.Slight misstatement here.
pigz -dis faster thangzip, but it's not parallelized AFAIK. Parallelized decompression of standard non-block gzip is difficult.pigzto compress is parallelized.
pigz decompression is slightly parallelized: one thread for IO and one thread for decompression. It is maybe 20% faster, which does not make a big difference. If you are interested in parallel decompression you should take a look at bgzip, from the bioinformatic world. It compresses the input data in chunks and generates an index with pointers to those chunks, so you can decompress in parallel and/or only a section of the file without having to read all prior data. The drawback is that the file is slightly bigger than a standard gzip file (extra headers every 64KiB) and that you need the file to be compressed with bgzip and the index generated to take advantage of this. But the file is fully gzip compliant and can be processed with any other compliant tool (gzip, pigz...).
https://www.htslib.org/doc/tabix.html https://doi.org/10.1093/bioinformatics/btq671
I work in bioinformatics too! Tabix allows indexing a sorted bgzip-compressed tabular file, so that you can seek to an arbitrary record in the file.