bcftools icon indicating copy to clipboard operation
bcftools copied to clipboard

Using COUNT with bcftools filter across all samples

Open mhoban opened this issue 11 months ago • 2 comments

Is there a way to use bcftools filter with COUNT to return only positions with a certain number of values for a given FORMAT entry across all samples?

For example, I'd like to filter out positions where the number of all AD values is exactly two. I can do this for one sample, like this:

bcftools filter -i 'COUNT(AD[0:*]) == 2' input.vcf.gz > filtered.vcf

Is there a way to do this for all samples? Or do I have to enter each position manually? I've tried various permutations of the indexing syntax but it doesn't quite seem to do what I need.

thanks!

mhoban avatar Feb 01 '25 23:02 mhoban

I think that should be possible using the reverse logic of -e, for example

bcftools filter -e 'COUNT(AD[0:*]) != 2'

pd3 avatar Feb 20 '25 14:02 pd3

This issue was closed so I'm not sure if my comment will be seen, but your suggestion doesn't work for me. Using the syntax you gave, it only filters against the first sample (sample 0).

I'm looking for a filter statement that will match the condition against ALL samples.

Consider the following, in which I run your suggested filter, query for REF, ALT, and AD values, then grep for cases where COUNT(AD) is greater than 2:

bcftools filter -e 'COUNT(AD[0:*]) != 2' test.vcf.gz | bcftools query -f '%REF,%ALT[\t%AD]' | ggrep -E '[0-9]+,[0-9]+,[0-9]+' | head -5

Which results in this (shown as a table to make it easier to read):

REF/ALT sample1_AD sample2_AD sample3_AD sample4_AD sample5_AD sample6_AD sample7_AD sample8_AD ample9_AD sample10_AD
G,A 2,0 4,1 9,0 1,0 1,0,0 9,0 4,0 4,2 6,0 1,0
G,A 4,0 2,0 2,0 1,0,0,0,0,0 7,6 4,0 2,0 4,0 4,0 10,4
A,G 4,0 5,0 4,0 4,0 8,0 4,0 10,4 10,0 10,0 6,0,4
C,T 4,0 6,0 4,0 1,0,0 6,6 7,0 2,0 3,2 2,2 2,2
T,C 2,0 0,2 0,1 6,0 1,0,0 8,1 1,0 8,0 1,2 0,1

You can see that for each of these rows, there's at least one sample that has more than two AD values (but it's never sample1). This makes sense, because the syntax of the filter expression you suggested only checks the first sample.

My ultimate goal is to retain sites where ALL samples have the same number of AD values as the total number of REF and ALT alleles, something like this (which doesn't work as written, but seems like it should given the expression syntax documentation):

bcftools filter -i 'COUNT(AD[*:*]) == (COUNT(REF) + COUNT(ALT))'

It's easy enough for me to generate something like this, which checks each sample individually:

bcftools filter -i '(COUNT(AD[0:*]) == (COUNT(REF) + COUNT(ALT))) & (COUNT(AD[1:*]) == (COUNT(REF) + COUNT(ALT))) & (COUNT(AD[2:*]) == (COUNT(REF) + COUNT(ALT))) & ...'

But I was hoping there was a quicker and more elegant way to do it.

mhoban avatar Feb 20 '25 23:02 mhoban