open-cravat Numeric values in VCF file are not parsed properly

Uploading vcfs to opencravat seems to result in incorrect parsing of the numeric values (likely parsed as strings), which leads to the hindered filtering

The header of the VCF file is atached: ##fileformat=VCFv4.2 ##FILTER=<ID=PASS,Description="All filters passed"> ##FILTER=<ID=RefCall,Description="Genotyping model thinks this site is reference."> ##FILTER=<ID=LowQual,Description="Confidence in this variant being real is below calling threshold."> ##FILTER=<ID=NoCall,Description="Site has depth=0 resulting in no call."> ##INFO=<ID=END,Number=1,Type=Integer,Description="End position (for use with symbolic alleles)"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Conditional genotype quality"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth"> ##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block."> ##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Read depth for each allele"> ##FORMAT=<ID=VAF,Number=A,Type=Float,Description="Variant allele fractions."> ##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Phred-scaled genotype likelihoods rounded to the closest integer"> ##FORMAT=<ID=MED_DP,Number=1,Type=Integer,Description="Median DP observed within the GVCF block rounded to the nearest integer.">

May 02 '22 12:05 bogdanovvp

Upd: this seems to be an issue in .sqlite generation. Changing types within the sqlite to: variant / vcfinfo__phred "text" -> "real" variant / vcfinfo__alt_reads "text" -> "integer" variant / vcfinfo__tot_reads "text" -> "integer" variant / vcfinfo__af "text" -> "real"

And correcting the "type" values in the respective dictionaries in the "variant_header" table corrects the issue. The respective change should be implemented in the generating code.

May 09 '22 23:05 bogdanovvp

Upd2: this recent pull request generally solves the issue https://github.com/KarchinLab/open-cravat-modules-karchinlab/pull/11

May 10 '22 06:05 bogdanovvp

Hi bogdanovvp. Thanks a lot for the digging here, and the PR.

Unfortunately, the changes won't work for some jobs. For variants found in more than one sample, those columns are ; delimited lists, and have to be strings. We are currently planning work on better sample/cohort filtering.

For example, consider a variant in two samples: s1, and s2. The base__sample_id column will be s1;s2, and vcfinfo__alt_reads will be something like 15;28.

If you look into the sample table, the column values are better. base__alt_reads is integer, base__tot_reads is integer, and base__af is real. If it's possible for you to query the db directly, you could try that. Or, if you know there's only one sample, the change in your PR works great. But it won't work as a general fix.

We're working on better filtering, and are gathering use-cases. If you're willing to discuss more, I'm interested to know what you're trying to use these columns for.

Sep 12 '22 03:09 kmoad

This is fixed for single-sample vcfs here https://github.com/KarchinLab/open-cravat/issues/149

May 25 '23 17:05 kmoad

open-cravat open-cravat copied to clipboard

Numeric values in VCF file are not parsed properly

open-cravat
open-cravat copied to clipboard