htslib
htslib copied to clipboard
bgzip could allow specifying several files
Here's a few correct VCF files from a bioinformatics tiny-test-data repository:
$ find . -iname "*.vcf"
./sample1-bcbio-cancer.vcf
./sample2-bcbio-cancer.vcf
./spec-svs-v4.1.vcf
./spec-v4.3.vcf
Using xargs to bgzip all the files in the directory only compresses the first one:
$ find . -iname "*.vcf" | xargs bgzip
$ find . -iname "*.vcf.gz"
./sample1-bcbio-cancer.vcf.gz
It would be nice that globbing was supported from the actual tool, pretty much like any other commandline does, compressing all files that match *.vcf, not only one at a time:
$ bgzip *.vcf
$ find . -iname "*.vcf.gz"
./sample1-bcbio-cancer.vcf.gz
./sample2-bcbio-cancer.vcf.gz
The only two alternatives on that that seem to work is either a plain bash for loop:
for i in *.vcf; do bgzip $i; done
Or using GNU parallel.
Most commands don't support globbing. It's typically your shell that does it for you.
Eg try echo bgzip *.vcf and you'll see that _.vcf gets expanded up and passed as multiple arguments. If you really wanted bgzip to handle globbing then you'd have to be typing in bgzip "_.vcf".
What you're really asking for is for bgzip to accept multiple files instead of one file. I think this is actually a bug as the usage implies it already does this, but the coding doesn't loop.
Usage: bgzip [OPTIONS] [FILE] ...
The bgzip man page synopsis is
bgzip [-cdhB] [-b virtualOffset] [-s size] [file]
and the usage display is
Usage: bgzip [OPTIONS] [FILE] ...
which already disagree with each other.
I had thought bgzip only operated on one file at a time in agreement with gzip (and zip, but that's more akin to tar anyway), but it turns out that you can give several filenames to gzip and gunzip and they will compress or decompress each one. So it would be a reasonable enhancement to add this capability to bgzip. (Although it will be a pain due to the monolithic code of its main()!)
tabix interprets additional arguments as regions, so you would not be able to tabixify several files with tabix foo bar baz.
There is also a workaround of
find … | xargs -n1 bgzip
@brainstorm you could try using GNU Parallel:
parallel bgzip ::: *.vcf
This will multiple instances of bgzip each one with a different file. The number of parallel instances is up to the number of cores in your machine. Read the more about GNU Parallel, its man pages and tutorial and how to run in on multiple machines.
True that @hzpc-joostk, as I mentioned towards the end of the original post... the problem here is mostly a UX one: users expect bgzip to behave somewhat similarly to gzip/bzip2/etc... Basically following the POLA, bioinfo tools are quirky enough to use, let's not contribute further to it ;)
I am greatly sorry, I have missed your last comments in the end...
POLA is a good thing! You're right, as it also implements switches -d and -c; similar in other (de)compressors. 👍
implemented thr' #1642