varianttools
varianttools copied to clipboard
Size of genotype database
Here I compare size of the VCF input data and the genotype database generated:
[GW] ll *.vcf.gz
-rw-rw-r-- 1 gaow gaow 977K Mar 22 11:27 YRI.exon.2010_03.genotypes.vcf.gz
-rw-rw-r-- 1 gaow gaow 593K Mar 22 11:27 CEU.exon.2010_03.genotypes.vcf.gz
[GW] ll *.h5
-rw-rw-r-- 1 gaow gaow 3.1M Mar 22 11:38 tmp_1_90_genotypes.h5
-rw-rw-r-- 1 gaow gaow 5.4M Mar 22 11:38 tmp_91_202_genotypes.h5
-rw-rw-r-- 1 gaow gaow 11M Mar 22 11:39 tmp_1_90_genotypes_multi_genes.h5
I think the genotype data is unreasonably large ... isn't it?
BTW this is result running this notebook:
https://github.com/gaow/ismb-2018/blob/dev/VAT-ISMB-2018.ipynb
BTW if I revert to previous version the database size is:
-rw-r--r-- 1 student student 9.8M Mar 22 18:05 demo_genotype.DB
which is roughly 3.1M + 5.4M? Not sure what it is with the multi_genes.h5
.
Ideas to reduce size of genotype file include 1) better compression method in HDF5 and 2) use numpy.float16 for genotype storage (better not use int in case there is imputated data)