miller icon indicating copy to clipboard operation
miller copied to clipboard

[feature request] Progress bars & easier decompression

Open johnkerl opened this issue 3 years ago • 6 comments

A pretty typical Miller invocation for me looks like…

$ pv foobar.tsv.gz | pigz -d | mlr -t count -f foo then sort -nr count then head
 396MiB 0:00:53 [7.41MiB/s] [================================>] 100%
foo count
bar 123

pv is a superuseful command to display a progress bar. pigz -d decompresses in parallel faster than gzip.

pv foobar.tsv.gz | mlr --prepipe "pigz -d" -t count -f foo then sort -nr count then head works!

mlr --prepipe 'pv | pigz -d' --csv count -g barcode 1084526_1091679/outs/probe_info.csv.gz works in that it gives the correct answer, but the progress bar is not displayed, because pv doesn't know the file size unless it's given the name of the file as a command line argument or the file (but not a pipe) is it's standard input.

So this does work, it gives the right answer and displays the progress bar: pv <1084526_1091679/outs/probe_info.csv.gz | mlr --prepipe 'pigz -d' --csv count -g barcode

So on to my feature request.

  1. When the input to mlr is coming from a file, could it open the prepipe process with stdin be that actual file, rather than a pipe?
  2. Would you consider adding a -P option that display a progress bar, possibly by piping into pv, in which case it's just a shortcut for --prepipe pv?
  3. Would you consider add a -z option to guess the appropriate decompress program to use based on the input filename or alternatively guessed based on the first few bytes of the file? (like libmagic and file`)

These two features would cut my example command down from… pv foobar.tsv.gz | pigz -d | mlr -t count -f foo then sort -nr count then head to mlr -Ptz count -f foo then sort -nr count then head foobar.tsv.gz

Originally posted by @sjackman in https://github.com/johnkerl/miller/issues/77#issuecomment-742163372

johnkerl avatar Dec 10 '20 01:12 johnkerl

pigz -d decompresses in parallel faster than gzip.

Slight misstatement here. pigz -d is faster than gzip, but it's not parallelized AFAIK. Parallelized decompression of standard non-block gzip is difficult. pigz to compress is parallelized.

sjackman avatar Dec 10 '20 01:12 sjackman

Hi @sjackman ! Sorry for the long delay.

(1) There is --prepipe and --prepipex (3) Using Miller 6 the auto-decompress by filename extension comes for free -- no -z required. These work if there is a filename as input; if input is from stdin you can use --gzin et al. (2) I'd prefer to let pv do that awesome thing it does well without re-implementing that feature in Miller

Given those, as of Miller 6 we have

pv foobar.tsv.gz | mlr --gzin -t count -f foo then sort -nr count then head

which is a bit of an improvement perhaps ...

johnkerl avatar Oct 26 '21 04:10 johnkerl

Amazing! Lots of good stuff in those Miller 6 release notes. That's great news about automatically decompressing compressed files!

I was thinking that mlr -P could execute pv if it were installed, rather than reinvent that wheel. I wouldn't call it a high-priority wishlist item though. Feel free to close this issue if you like.

sjackman avatar Oct 28 '21 19:10 sjackman

I was thinking that mlr -P could execute pv if it were installed, rather than reinvent that wheel

ooh nice ... will think on ... :)

johnkerl avatar Oct 28 '21 19:10 johnkerl

pigz -d decompresses in parallel faster than gzip.

Slight misstatement here. pigz -d is faster than gzip, but it's not parallelized AFAIK. Parallelized decompression of standard non-block gzip is difficult. pigz to compress is parallelized.

pigz decompression is slightly parallelized: one thread for IO and one thread for decompression. It is maybe 20% faster, which does not make a big difference. If you are interested in parallel decompression you should take a look at bgzip, from the bioinformatic world. It compresses the input data in chunks and generates an index with pointers to those chunks, so you can decompress in parallel and/or only a section of the file without having to read all prior data. The drawback is that the file is slightly bigger than a standard gzip file (extra headers every 64KiB) and that you need the file to be compressed with bgzip and the index generated to take advantage of this. But the file is fully gzip compliant and can be processed with any other compliant tool (gzip, pigz...).

Poshi avatar Feb 18 '22 23:02 Poshi

https://www.htslib.org/doc/tabix.html https://doi.org/10.1093/bioinformatics/btq671

I work in bioinformatics too! Tabix allows indexing a sorted bgzip-compressed tabular file, so that you can seek to an arbitrary record in the file.

sjackman avatar Feb 21 '22 18:02 sjackman