vawk
vawk copied to clipboard
python wrapper is slower than straight awk
time zcat Omni25_genotypes_2141_samples.b37.v2.vcf.gz | vawk --header '{ print $1,$2,$3,$4,$5,$6,$7,$8,$9,S$NA12878 }' | bgzip -c > NA12878.omni.vcf.gz
# real 19m43.893s
# user 21m27.138s
# sys 0m59.355s
# aside: outside python it's 25% faster
time zcat Omni25_genotypes_2141_samples.b37.v2.vcf.gz | awk 'BEGIN {FS=" "; OFS="\t"; } {if ($0~"^#") {if ($0!~"^##") { for (i=10;i<=NF;++i) SAMPLE[$i]=i; }; print} else {split($9,fmt,":"); SAMPLE_NA12878_ALL=$SAMPLE["NA12878"]; print $1,$2,$3,$4,$5,$6,$7,$8,$9,SAMPLE_NA12878_ALL }} END {}' - | bgzip -c > NA12878.omni2.vcf.gz
# real 15m28.737s
# user 17m1.151s
# sys 0m15.266s
Thanks for creating and excellent tool; Is it possible to add a feature that affect all samples : ex: '{ if (I$AF>0.5) print $1,$2,$3,I$AN, $ALL_SAMPLES_section }'
PS: if VCF has hundreds of samples its difficult to input every sample tags (column format and all samples) in the command.
for example: S$*$GT represents all samples GT ?
Doesn't just plain "print" do that?
at the moment, " { print } | cut -f 10- " nearly accomplishes this, but I will add that feature for cleaner queries.
On Monday, October 20, 2014, Matt Shirley [email protected] wrote:
Doesn't just plain "print" do that?
— Reply to this email directly or view it on GitHub https://github.com/cc2qe/vawk/issues/4#issuecomment-59814099.
Sent from mobile.
Good point.
Matt Shirley http://mattshirley.com/
On Oct 20, 2014, at 2:27 PM, Colby Chiang [email protected] wrote:
at the moment, " { print } | cut -f 10- " nearly accomplishes this, but I will add that feature for cleaner queries.
On Monday, October 20, 2014, Matt Shirley [email protected] wrote:
Doesn't just plain "print" do that?
— Reply to this email directly or view it on GitHub https://github.com/cc2qe/vawk/issues/4#issuecomment-59814099.
Sent from mobile. — Reply to this email directly or view it on GitHub https://github.com/cc2qe/vawk/issues/4#issuecomment-59816028.
It is very well acheived by regular awk and/or piping to post process...Combination of selection of info fileds and sub fileds of every sample information like GT etc directly though vawk...will be benefiial. As you are specialising the vawk for VCFs, I requested this feature....
@mdshw5 definelty plain "print" does not acheive that....needs post processing...
The feature described above has been added in commit https://github.com/cc2qe/vawk/commit/a36785d86484848c44b073c833769e828aacfa8c
@cc2qe : thanks working great