varianttools
varianttools copied to clipboard
Genotype annotations
A typical genotype entry looks like:
0/0:43,0:43:92:0,92,1267
The first part 0/0
is the actual genotype; the others are genotype annotations. In our current implementation (vat 2.0 hereafter) we import GT by default and others optional. We do import everything because we want to be able to create filters when performing quality control or calculating summary stats.
However in many scenarios the genotype data have already being QC-ed. Also we may start from un-QC-ed genotype data, yet after QC we'll no longer need those other genotype information. That is when we may want to create new projects that only keeps the GT info.
Can we make each field in genotype data a separate data matrix? For example we have a project that looks like:
project.variants
project.GT
project.DP
And our filtering would be
vtools samples <various geno_info based filtering> -t project.gmask
vtools select project.gmask project.GT ..
where gmask
is a sparse matrix of zero or ones. Zero means the entry is to be excluded, one means to be included, in computing other statistics.
BTW, genotype information is important for QC, but people usually just do QC first and stick to the VCF file after QC. I have a standard gzipped VCF file that is 130GB including all genotype info, for WGS of 650 samples. But after QC and only keeping GT the zippd file size becomes 2.7GB. I can imaging it may be even smaller if we use sparse matrix for it.