varianttools icon indicating copy to clipboard operation
varianttools copied to clipboard

Size of genotype database

Open gaow opened this issue 5 years ago • 2 comments

Here I compare size of the VCF input data and the genotype database generated:

[GW] ll *.vcf.gz
-rw-rw-r-- 1 gaow gaow 977K Mar 22 11:27 YRI.exon.2010_03.genotypes.vcf.gz
-rw-rw-r-- 1 gaow gaow 593K Mar 22 11:27 CEU.exon.2010_03.genotypes.vcf.gz
[GW] ll *.h5
-rw-rw-r-- 1 gaow gaow 3.1M Mar 22 11:38 tmp_1_90_genotypes.h5
-rw-rw-r-- 1 gaow gaow 5.4M Mar 22 11:38 tmp_91_202_genotypes.h5
-rw-rw-r-- 1 gaow gaow  11M Mar 22 11:39 tmp_1_90_genotypes_multi_genes.h5

I think the genotype data is unreasonably large ... isn't it?

BTW this is result running this notebook:

https://github.com/gaow/ismb-2018/blob/dev/VAT-ISMB-2018.ipynb

gaow avatar Mar 22 '19 17:03 gaow

BTW if I revert to previous version the database size is:

-rw-r--r-- 1 student student 9.8M Mar 22 18:05 demo_genotype.DB

which is roughly 3.1M + 5.4M? Not sure what it is with the multi_genes.h5.

gaow avatar Mar 22 '19 18:03 gaow

Ideas to reduce size of genotype file include 1) better compression method in HDF5 and 2) use numpy.float16 for genotype storage (better not use int in case there is imputated data)

gaow avatar Oct 28 '19 16:10 gaow