vawk icon indicating copy to clipboard operation
vawk copied to clipboard

error with dbNSFP_GERP++ info fields

Open gpcr opened this issue 9 years ago • 5 comments

vawk '{print I$dbNSFP_GERP++_RS}' input vcf

prints only 0

example line from vcf to test:

11 47353646 . C T . . dbNSFP_GERP++_RS=4.33;dbNSFP_GERP++_RS_rankscore=0.52;dbNSFP_phyloP46way_primate=-0.41;

gpcr avatar Jul 31 '15 04:07 gpcr

I think this is expected awk behavior, and I don't see an easy way to fix it other than rename fields containing ++. ++ is the auto increment operator, and there is a distinction between pre- and post-increment. In this case I think I$dbNSFP_GERP is initialized with a 0 value, and since this is a post-increment returns the initial 0 value. The only thing I'm not sure about is how the uninitialized _RS is handled, as this would normally result in concatenation, though these are integer values and not strings.

mdshw5 avatar Jul 31 '15 13:07 mdshw5

Thanks @mdshw5 for the comment: Error persists even trying I$dbNSFP_GERP++_RS to over come Awks natural behaviour.

gpcr avatar Aug 03 '15 19:08 gpcr

The reason is because the "+" character is not allowed in info or sample tags. I've been thinking about how to overcome this restriction. I don't want to just add "+" to legal characters because that would prohibit queries like vawk '{ print I$AF+1}'. However, I've run into this issue a few times already, and agree that it is annoying.

  1. I guess one option is to scrap the current syntax and replace it with something like: vawk '{ print I["AF"]+1 }'. Then accessing sample genotype fields would look like vawk '{ print S["MYSAMP"]["GT"] }'. This allows more flexibility, and is also awk-like, since it uses associative arrays. I wrote it the way I did to save keystrokes, but I now sort of regret not using this syntax originally.
  2. Another option that preserves is backwards compatibilty would be to optionally allow double quotes around vawk info fields: vawk '{ print I$"AF"+1 }', and for your case: vawk '{ print I$"dbNSFP_GERP++_RS" }'

do you guys have a preference? I'm sort of leaning toward option 1

cc2qe avatar Aug 03 '15 19:08 cc2qe

I think the first option is preferable.

mdshw5 avatar Aug 03 '15 19:08 mdshw5

thanks for your responses: I too prefer option one

Feature request:Can you please make an option for "Flag"ged info fields with no value just the tag

For example: ##INFO=<ID=clinvar_G5,Number=0,Type=Flag,Description=">5% minor allele frequency in 1+ populations.">

##INFO=<ID=dbSNP_U3,Number=0,Type=Flag,Description="In 3' UTR Location is in an untranslated region (UTR). FxnCode = 53.">

##INFO=<ID=deCODE_singleton,Number=0,Type=Flag,Description="Seen in a single sample.">

gpcr avatar Aug 03 '15 19:08 gpcr