bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

`bcftools merge` seemingly opens all files together, hitting open file limits

Open jchorl opened this issue 2 months ago • 3 comments

I tried to run bcftools merge on thousands of files. This runs up against the open fd limit on my linux machine.

I managed to reproduce by generating thousands of vcfs, then running:

docker run -it --rm -v $(pwd):/work -w /work --ulimit nofile=2048:2048 ubuntu:24.04 bash

apt-get update
apt-get install -y bcftools

bcftools concat -a -O z -f file-list.txt -o /dev/null

The result:

root@b0b48a108195:/work# bcftools concat -a -O z -f file-list.txt -o /dev/null
Checking the headers and starting positions of 10000 files
[E::hts_idx_load3] Could not load local index file 'generated_vcfs/06_02044_chr6.vcf.bgz.csi' : Too many open files
Failed to open generated_vcfs/06_02044_chr6.vcf.bgz: could not load index

Intuitively, I would expect merge to handle many, many files. I know I can just do a recursive merge, but does it need to open all the files at the same time?

Thanks!!

jchorl avatar Oct 30 '25 13:10 jchorl

You say bcftools merge but show bcftools concat.

When the files can overlap (concat -a), all files can have the same coordinate, so there is really no way around keeping all files open, since open/close are system calls and are expensive operations. You can either increase your limits (ulimit -n) or merge recursively

pd3 avatar Oct 31 '25 13:10 pd3

Hey @pd3 , you're very right that I meant concat. Sorry!

Would there be appetite to add recursive-concat into bcftools? We could use a --batch-size argument or similar.

Ordinarily I would be a proponent of letting the caller handle the recursive concat. The issue here is that concat is a batch operation. It is also not obvious that concatting thousands of files would need to open them all at once (although your explanation makes sense). Therefore this seems like a bit of a footgun that could be solved by the tool itself. Another way to think about this is, there is an unexpected scaling limit when using the tool as intended, that could be solved by the tool.

Let me know if there is appetite to add this functionality, I'd be happy to take a stab at it.

Thanks!

jchorl avatar Oct 31 '25 17:10 jchorl

Mmm, I’m not sure that’s a good idea. Let’s first consider where those thousands of files you want to concatenate are coming from — why are they overlapping in the first place? The program was originally designed for reassembling genome chunks; for example, when you split the genome into regions, perform variant calling on each, and then merge the chunks back together. This can be done without any overlaps, provided the correct --regions-overlap pos option was used, in which case concatenation is straightforward.

If there’s only a small amount of overlap — that is, if the chunks are sequential and only the adjacent ones can overlap — then the best approach is to determine their starting coordinates and keep only the relevant files open at a time. This is what the phase concatenation already does: it keeps just two files open simultaneously.

I think it would be difficult to integrate this cleanly into the existing code. Probably a new branch in the concat() function would have to be added, similar to the current phased_concat branch.

pd3 avatar Nov 04 '25 11:11 pd3