bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

feature request: duplicate merging and selecting

Open davmlaw opened this issue 1 year ago • 1 comments

My use case is lifting over gnomAD v4.1 from GRCh38 to T2T-CHM13v2.0 and sometimes multiple GRCh38 variants resolve to the same T2T coordinate - I want to be able to process these duplicates (say picking highest or lowest AF) rather than just taking the first in the file

Control how selecting duplicates works

It would be really useful to be able to choose which one to take. You could do this by defining how to sort the dupes then taking the 1st, for instance take the one with the highest AF, then highest AC with --rm-dup-sort=-AF,-AC or --rm-dup--sort=AF:desc,AC:desc

Merge functionality with duplicates

Merge has --info-rules which works with the same variant across different files. It would be nice to be able to apply this to same variant in the same file, for instance norm --rm-dup --info-rules=BCFTOOLS_OLD_VARIANT:join would have allowed a workaround for this issue

Mark duplicates

Another way to solve this would be to mark duplicates rather than remove them, for instance a DUPLICATE flag.

Then I could select them out into a separate file and:

  • Use existing merge to bring them back with --info-rules
  • Process this much smaller file in Python and process them however I want then merge back (much quicker than processing ~100G of compressed VCF in Python)

davmlaw avatar Dec 18 '24 04:12 davmlaw

I am thinking if the +mark-overlaps plugin would be a good place to add some new functionality to help with this?

pd3 avatar Jan 07 '25 14:01 pd3