htslib icon indicating copy to clipboard operation
htslib copied to clipboard

bgzip could allow specifying several files

Open brainstorm opened this issue 9 years ago • 6 comments
trafficstars

Here's a few correct VCF files from a bioinformatics tiny-test-data repository:

$ find . -iname "*.vcf"
./sample1-bcbio-cancer.vcf
./sample2-bcbio-cancer.vcf
./spec-svs-v4.1.vcf
./spec-v4.3.vcf

Using xargs to bgzip all the files in the directory only compresses the first one:

$ find . -iname "*.vcf" | xargs bgzip
$ find . -iname "*.vcf.gz"
./sample1-bcbio-cancer.vcf.gz

It would be nice that globbing was supported from the actual tool, pretty much like any other commandline does, compressing all files that match *.vcf, not only one at a time:

$ bgzip *.vcf
$ find . -iname "*.vcf.gz"
./sample1-bcbio-cancer.vcf.gz
./sample2-bcbio-cancer.vcf.gz

The only two alternatives on that that seem to work is either a plain bash for loop:

for i in *.vcf; do bgzip $i; done

Or using GNU parallel.

brainstorm avatar Jun 16 '16 09:06 brainstorm

Most commands don't support globbing. It's typically your shell that does it for you.

Eg try echo bgzip *.vcf and you'll see that _.vcf gets expanded up and passed as multiple arguments. If you really wanted bgzip to handle globbing then you'd have to be typing in bgzip "_.vcf".

What you're really asking for is for bgzip to accept multiple files instead of one file. I think this is actually a bug as the usage implies it already does this, but the coding doesn't loop.

Usage:   bgzip [OPTIONS] [FILE] ...

jkbonfield avatar Jun 16 '16 10:06 jkbonfield

The bgzip man page synopsis is

bgzip [-cdhB] [-b virtualOffset] [-s size] [file]

and the usage display is

Usage: bgzip [OPTIONS] [FILE] ...

which already disagree with each other.

I had thought bgzip only operated on one file at a time in agreement with gzip (and zip, but that's more akin to tar anyway), but it turns out that you can give several filenames to gzip and gunzip and they will compress or decompress each one. So it would be a reasonable enhancement to add this capability to bgzip. (Although it will be a pain due to the monolithic code of its main()!)

tabix interprets additional arguments as regions, so you would not be able to tabixify several files with tabix foo bar baz.

jmarshall avatar Jun 16 '16 10:06 jmarshall

There is also a workaround of

find … | xargs -n1 bgzip

jmarshall avatar Jun 20 '16 13:06 jmarshall

@brainstorm you could try using GNU Parallel:

parallel bgzip ::: *.vcf

This will multiple instances of bgzip each one with a different file. The number of parallel instances is up to the number of cores in your machine. Read the more about GNU Parallel, its man pages and tutorial and how to run in on multiple machines.

hzpc-joostk avatar Jun 29 '17 07:06 hzpc-joostk

True that @hzpc-joostk, as I mentioned towards the end of the original post... the problem here is mostly a UX one: users expect bgzip to behave somewhat similarly to gzip/bzip2/etc... Basically following the POLA, bioinfo tools are quirky enough to use, let's not contribute further to it ;)

brainstorm avatar Jun 29 '17 08:06 brainstorm

I am greatly sorry, I have missed your last comments in the end...

POLA is a good thing! You're right, as it also implements switches -d and -c; similar in other (de)compressors. 👍

hzpc-joostk avatar Jun 29 '17 08:06 hzpc-joostk

implemented thr' #1642

vasudeva8 avatar Oct 12 '23 14:10 vasudeva8