open-cravat
open-cravat copied to clipboard
Numeric values in VCF file are not parsed properly
Uploading vcfs to opencravat seems to result in incorrect parsing of the numeric values (likely parsed as strings), which leads to the hindered filtering

The header of the VCF file is atached: ##fileformat=VCFv4.2 ##FILTER=<ID=PASS,Description="All filters passed"> ##FILTER=<ID=RefCall,Description="Genotyping model thinks this site is reference."> ##FILTER=<ID=LowQual,Description="Confidence in this variant being real is below calling threshold."> ##FILTER=<ID=NoCall,Description="Site has depth=0 resulting in no call."> ##INFO=<ID=END,Number=1,Type=Integer,Description="End position (for use with symbolic alleles)"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Conditional genotype quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block."> ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Read depth for each allele"> ##FORMAT=<ID=VAF,Number=A,Type=Float,Description="Variant allele fractions."> ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Phred-scaled genotype likelihoods rounded to the closest integer"> ##FORMAT=<ID=MED_DP,Number=1,Type=Integer,Description="Median DP observed within the GVCF block rounded to the nearest integer.">
Upd: this seems to be an issue in .sqlite generation. Changing types within the sqlite to: variant / vcfinfo__phred "text" -> "real" variant / vcfinfo__alt_reads "text" -> "integer" variant / vcfinfo__tot_reads "text" -> "integer" variant / vcfinfo__af "text" -> "real"
And correcting the "type" values in the respective dictionaries in the "variant_header" table corrects the issue. The respective change should be implemented in the generating code.
Upd2: this recent pull request generally solves the issue https://github.com/KarchinLab/open-cravat-modules-karchinlab/pull/11
Hi bogdanovvp. Thanks a lot for the digging here, and the PR.
Unfortunately, the changes won't work for some jobs. For variants found in more than one sample, those columns are ; delimited lists, and have to be strings. We are currently planning work on better sample/cohort filtering.
For example, consider a variant in two samples: s1, and s2. The base__sample_id column will be s1;s2, and vcfinfo__alt_reads will be something like 15;28.
If you look into the sample table, the column values are better. base__alt_reads is integer, base__tot_reads is integer, and base__af is real. If it's possible for you to query the db directly, you could try that. Or, if you know there's only one sample, the change in your PR works great. But it won't work as a general fix.
We're working on better filtering, and are gathering use-cases. If you're willing to discuss more, I'm interested to know what you're trying to use these columns for.
This is fixed for single-sample vcfs here https://github.com/KarchinLab/open-cravat/issues/149