sgn
sgn copied to clipboard
Genotype upload verification tool
Expected Behavior
After a genotype upload, need to verify that everything is correct in the database. By nd_protocol or by nd_file uploaded
should do summary statistics and spot checks
graphical viewer?
For Bugs:
Environment
Steps to Reproduce
I looked at three tools to give statistics for vcf files.
- vcftools - only very basic summary
- bcftools - a little better still not much detail
- rtg-tools - gives statistics for each accession, requires chromosome and position for each entry of vcf file Would it be possible to add a feature to the download page so that you can get output in VCF format? Then you can compare it to input using rtg-tools or another method.
@ClayBirkett the file is archived in the server's filesystem and the genotype data is also in the database. vcftools could be run on the vcf from a request on the website, but that adds vcftools as a dependency on the website. what kind of summary stats would vcftools give that we could write ourselves in perl by querying the genotype data from the database?
@ClayBirkett @nickmorales summary stats on a vcf can be as various as the info fields provided in the input vcf. As a basic check up we may like to focus on -> missing data per individual -> missing data per marker -> ability to filter on minor frequency -> ability to filter on bi-allelic markers (vs multiallelic)
As these filters/stats are pretty simple and for code base "sustainability" it might be better to have this code natively rather than in dependencies (although vcftool and bcftools are C/perl libs). @ClayBirkett, plink is an other great lib for these activities
Other fields of interest but a bit more advanced in terms of computation and which would probably require vcftool/bcftools to be added to the dependencies: -> allele depth -> hardy weinberg filtering -> linkage disequilibrium pruning
vcf, bcftools can do many other types of filtering but some of them may simply crash, if vcf input format and version is not properly checked at upload step.