bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

[FEATURE REQUEST] Window limits for removing duplicates/normalizing

Open cukelarter opened this issue 3 months ago • 1 comments

We have very large files that span many samples and are in the scale of several TB per chromosome. In order to perform some QC, we split into smaller files, run the QC, and then remerge into one file. The problem is that the script we used to split the files had some issues and sometimes introduced duplicates.

For example, if I have two regions I want to split from the master file: chr11:10001-20000 and chr11:20001-30000 And there is a variant with the following information: chr11:19999:A:ATTAG

It will appear in both resulting files. I believe the filtering captures not only chromosome position but also factors in the length of the alt sequence.

Regardless, we now have a lot of files and need to remove these duplicates. We have tried concat --rm-dup and also norm but both run into fatal memory issues or otherwise take a significantly long time to run. I wanted to see if there was a way to restrict how the function keeps track of duplicate files, for example only looking within a specified window, in order to reduce memory overhead and hopefully shorten the time it takes to run these. I am desperate and open to any alternative solutions as well!

cukelarter avatar Sep 30 '25 16:09 cukelarter

For splitting into regions for this purpose it is best to run with --regions-overlap pos or --targets-overlap pos, see http://samtools.github.io/bcftools/bcftools.html#common_options

Otherwise bcftools norm has the -w, --site-win INT option and you'd like to leave out the -f, --fasta-ref option, but I don't think that will make much difference. If your inputs or output is a VCF, rather than BCF, most of the CPU will be taken by VCF<->BCF conversion (the program internally uses the binary BCF representation).

I can imagine for a simple operation like this a simple perl / python script might perform comparably well or even be faster.

pd3 avatar Oct 13 '25 09:10 pd3