`bcftools merge` seemingly opens all files together, hitting open file limits
I tried to run bcftools merge on thousands of files. This runs up against the open fd limit on my linux machine.
I managed to reproduce by generating thousands of vcfs, then running:
docker run -it --rm -v $(pwd):/work -w /work --ulimit nofile=2048:2048 ubuntu:24.04 bash
apt-get update
apt-get install -y bcftools
bcftools concat -a -O z -f file-list.txt -o /dev/null
The result:
root@b0b48a108195:/work# bcftools concat -a -O z -f file-list.txt -o /dev/null
Checking the headers and starting positions of 10000 files
[E::hts_idx_load3] Could not load local index file 'generated_vcfs/06_02044_chr6.vcf.bgz.csi' : Too many open files
Failed to open generated_vcfs/06_02044_chr6.vcf.bgz: could not load index
Intuitively, I would expect merge to handle many, many files. I know I can just do a recursive merge, but does it need to open all the files at the same time?
Thanks!!
You say bcftools merge but show bcftools concat.
When the files can overlap (concat -a), all files can have the same coordinate, so there is really no way around keeping all files open, since open/close are system calls and are expensive operations. You can either increase your limits (ulimit -n) or merge recursively
Hey @pd3 , you're very right that I meant concat. Sorry!
Would there be appetite to add recursive-concat into bcftools? We could use a --batch-size argument or similar.
Ordinarily I would be a proponent of letting the caller handle the recursive concat. The issue here is that concat is a batch operation. It is also not obvious that concatting thousands of files would need to open them all at once (although your explanation makes sense). Therefore this seems like a bit of a footgun that could be solved by the tool itself. Another way to think about this is, there is an unexpected scaling limit when using the tool as intended, that could be solved by the tool.
Let me know if there is appetite to add this functionality, I'd be happy to take a stab at it.
Thanks!
Mmm, I’m not sure that’s a good idea. Let’s first consider where those thousands of files you want to concatenate are coming from — why are they overlapping in the first place? The program was originally designed for reassembling genome chunks; for example, when you split the genome into regions, perform variant calling on each, and then merge the chunks back together. This can be done without any overlaps, provided the correct --regions-overlap pos option was used, in which case concatenation is straightforward.
If there’s only a small amount of overlap — that is, if the chunks are sequential and only the adjacent ones can overlap — then the best approach is to determine their starting coordinates and keep only the relevant files open at a time. This is what the phase concatenation already does: it keeps just two files open simultaneously.
I think it would be difficult to integrate this cleanly into the existing code. Probably a new branch in the concat() function would have to be added, similar to the current phased_concat branch.