miller
miller copied to clipboard
[feature request] Progress bars & easier decompression
A pretty typical Miller invocation for me looks like…
$ pv foobar.tsv.gz | pigz -d | mlr -t count -f foo then sort -nr count then head
396MiB 0:00:53 [7.41MiB/s] [================================>] 100%
foo count
bar 123
pv
is a superuseful command to display a progress bar.
pigz -d
decompresses in parallel faster than gzip
.
pv foobar.tsv.gz | mlr --prepipe "pigz -d" -t count -f foo then sort -nr count then head
works!
mlr --prepipe 'pv | pigz -d' --csv count -g barcode 1084526_1091679/outs/probe_info.csv.gz
works in that it gives the correct answer, but the progress bar is not displayed, because pv
doesn't know the file size unless it's given the name of the file as a command line argument or the file (but not a pipe) is it's standard input.
So this does work, it gives the right answer and displays the progress bar:
pv <1084526_1091679/outs/probe_info.csv.gz | mlr --prepipe 'pigz -d' --csv count -g barcode
So on to my feature request.
- When the input to
mlr
is coming from a file, could it open theprepipe
process withstdin
be that actual file, rather than a pipe? - Would you consider adding a
-P
option that display a progress bar, possibly by piping intopv
, in which case it's just a shortcut for--prepipe pv
? - Would you consider add a
-z
option to guess the appropriate decompress program to use based on the input filename or alternatively guessed based on the first few bytes of the file? (likelibmagic
and file`)
These two features would cut my example command down from…
pv foobar.tsv.gz | pigz -d | mlr -t count -f foo then sort -nr count then head
to
mlr -Ptz count -f foo then sort -nr count then head foobar.tsv.gz
Originally posted by @sjackman in https://github.com/johnkerl/miller/issues/77#issuecomment-742163372
pigz -d
decompresses in parallel faster thangzip
.
Slight misstatement here. pigz -d
is faster than gzip
, but it's not parallelized AFAIK. Parallelized decompression of standard non-block gzip is difficult. pigz
to compress is parallelized.
Hi @sjackman ! Sorry for the long delay.
(1) There is --prepipe
and --prepipex
(3) Using Miller 6 the auto-decompress by filename extension comes for free -- no -z
required. These work if there is a filename as input; if input is from stdin you can use --gzin
et al.
(2) I'd prefer to let pv
do that awesome thing it does well without re-implementing that feature in Miller
Given those, as of Miller 6 we have
pv foobar.tsv.gz | mlr --gzin -t count -f foo then sort -nr count then head
which is a bit of an improvement perhaps ...
Amazing! Lots of good stuff in those Miller 6 release notes. That's great news about automatically decompressing compressed files!
I was thinking that mlr -P
could execute pv
if it were installed, rather than reinvent that wheel. I wouldn't call it a high-priority wishlist item though. Feel free to close this issue if you like.
I was thinking that mlr -P could execute pv if it were installed, rather than reinvent that wheel
ooh nice ... will think on ... :)
pigz -d
decompresses in parallel faster thangzip
.Slight misstatement here.
pigz -d
is faster thangzip
, but it's not parallelized AFAIK. Parallelized decompression of standard non-block gzip is difficult.pigz
to compress is parallelized.
pigz decompression is slightly parallelized: one thread for IO and one thread for decompression. It is maybe 20% faster, which does not make a big difference. If you are interested in parallel decompression you should take a look at bgzip, from the bioinformatic world. It compresses the input data in chunks and generates an index with pointers to those chunks, so you can decompress in parallel and/or only a section of the file without having to read all prior data. The drawback is that the file is slightly bigger than a standard gzip file (extra headers every 64KiB) and that you need the file to be compressed with bgzip and the index generated to take advantage of this. But the file is fully gzip compliant and can be processed with any other compliant tool (gzip, pigz...).
https://www.htslib.org/doc/tabix.html https://doi.org/10.1093/bioinformatics/btq671
I work in bioinformatics too! Tabix allows indexing a sorted bgzip-compressed tabular file, so that you can seek to an arbitrary record in the file.