vawk
vawk copied to clipboard
error with dbNSFP_GERP++ info fields
vawk '{print I$dbNSFP_GERP++_RS}' input vcf
prints only 0
example line from vcf to test:
11 47353646 . C T . . dbNSFP_GERP++_RS=4.33;dbNSFP_GERP++_RS_rankscore=0.52;dbNSFP_phyloP46way_primate=-0.41;
I think this is expected awk behavior, and I don't see an easy way to fix it other than rename fields containing ++
. ++
is the auto increment operator, and there is a distinction between pre- and post-increment. In this case I think I$dbNSFP_GERP
is initialized with a 0
value, and since this is a post-increment returns the initial 0
value. The only thing I'm not sure about is how the uninitialized _RS
is handled, as this would normally result in concatenation, though these are integer values and not strings.
Thanks @mdshw5 for the comment: Error persists even trying I$dbNSFP_GERP++_RS to over come Awks natural behaviour.
The reason is because the "+" character is not allowed in info or sample tags. I've been thinking about how to overcome this restriction. I don't want to just add "+" to legal characters because that would prohibit queries like vawk '{ print I$AF+1}'
. However, I've run into this issue a few times already, and agree that it is annoying.
- I guess one option is to scrap the current syntax and replace it with something like:
vawk '{ print I["AF"]+1 }'
. Then accessing sample genotype fields would look likevawk '{ print S["MYSAMP"]["GT"] }'
. This allows more flexibility, and is also awk-like, since it uses associative arrays. I wrote it the way I did to save keystrokes, but I now sort of regret not using this syntax originally. - Another option that preserves is backwards compatibilty would be to optionally allow double quotes around vawk info fields:
vawk '{ print I$"AF"+1 }'
, and for your case:vawk '{ print I$"dbNSFP_GERP++_RS" }'
do you guys have a preference? I'm sort of leaning toward option 1
I think the first option is preferable.
thanks for your responses: I too prefer option one
Feature request:Can you please make an option for "Flag"ged info fields with no value just the tag
For example: ##INFO=<ID=clinvar_G5,Number=0,Type=Flag,Description=">5% minor allele frequency in 1+ populations.">
##INFO=<ID=dbSNP_U3,Number=0,Type=Flag,Description="In 3' UTR Location is in an untranslated region (UTR). FxnCode = 53.">
##INFO=<ID=deCODE_singleton,Number=0,Type=Flag,Description="Seen in a single sample.">