vawk icon indicating copy to clipboard operation
vawk copied to clipboard

python wrapper is slower than straight awk

Open cc2qe opened this issue 11 years ago • 7 comments

time zcat Omni25_genotypes_2141_samples.b37.v2.vcf.gz | vawk --header '{ print $1,$2,$3,$4,$5,$6,$7,$8,$9,S$NA12878 }' | bgzip -c > NA12878.omni.vcf.gz
# real    19m43.893s
# user    21m27.138s
# sys     0m59.355s

# aside: outside python it's 25% faster
time zcat Omni25_genotypes_2141_samples.b37.v2.vcf.gz | awk 'BEGIN {FS=" "; OFS="\t"; } {if ($0~"^#") {if ($0!~"^##") { for (i=10;i<=NF;++i) SAMPLE[$i]=i; }; print} else {split($9,fmt,":"); SAMPLE_NA12878_ALL=$SAMPLE["NA12878"];  print $1,$2,$3,$4,$5,$6,$7,$8,$9,SAMPLE_NA12878_ALL }} END {}' - | bgzip -c > NA12878.omni2.vcf.gz
# real    15m28.737s
# user    17m1.151s
# sys     0m15.266s

cc2qe avatar Aug 26 '14 16:08 cc2qe

Thanks for creating and excellent tool; Is it possible to add a feature that affect all samples : ex: '{ if (I$AF>0.5) print $1,$2,$3,I$AN, $ALL_SAMPLES_section }'

PS: if VCF has hundreds of samples its difficult to input every sample tags (column format and all samples) in the command.
for example: S$*$GT represents all samples GT ?

gpcr avatar Oct 20 '14 16:10 gpcr

Doesn't just plain "print" do that?

mdshw5 avatar Oct 20 '14 18:10 mdshw5

at the moment, " { print } | cut -f 10- " nearly accomplishes this, but I will add that feature for cleaner queries.

On Monday, October 20, 2014, Matt Shirley [email protected] wrote:

Doesn't just plain "print" do that?

— Reply to this email directly or view it on GitHub https://github.com/cc2qe/vawk/issues/4#issuecomment-59814099.

Sent from mobile.

cc2qe avatar Oct 20 '14 18:10 cc2qe

Good point.

Matt Shirley http://mattshirley.com/

On Oct 20, 2014, at 2:27 PM, Colby Chiang [email protected] wrote:

at the moment, " { print } | cut -f 10- " nearly accomplishes this, but I will add that feature for cleaner queries.

On Monday, October 20, 2014, Matt Shirley [email protected] wrote:

Doesn't just plain "print" do that?

— Reply to this email directly or view it on GitHub https://github.com/cc2qe/vawk/issues/4#issuecomment-59814099.

Sent from mobile. — Reply to this email directly or view it on GitHub https://github.com/cc2qe/vawk/issues/4#issuecomment-59816028.

mdshw5 avatar Oct 20 '14 19:10 mdshw5

It is very well acheived by regular awk and/or piping to post process...Combination of selection of info fileds and sub fileds of every sample information like GT etc directly though vawk...will be benefiial. As you are specialising the vawk for VCFs, I requested this feature....

@mdshw5 definelty plain "print" does not acheive that....needs post processing...

gpcr avatar Oct 21 '14 14:10 gpcr

The feature described above has been added in commit https://github.com/cc2qe/vawk/commit/a36785d86484848c44b073c833769e828aacfa8c

cc2qe avatar Oct 23 '14 19:10 cc2qe

@cc2qe : thanks working great

gpcr avatar Nov 12 '14 20:11 gpcr