covigator-ngs-pipeline icon indicating copy to clipboard operation
covigator-ngs-pipeline copied to clipboard

Produced VCFs are claimed to be malformed by IGV

Open priesgo opened this issue 2 years ago • 2 comments

When trying to load a VCF in IGV it gives the following error message:

The provided VCF file is malformed at approximately line number 69: The VCF specification does not allow for whitespace in the INFO field. Offending field value was "DP=29;AF=0.103448;SB=0;DP4=13,13,1,2;INDEL;HRUN=5;ANN=C|frameshift_variant|HIGH|ORF1ab|gene-GU280_gp01|transcript|TRANSCRIPT_gene-GU280_gp01|protein_coding|1/1|c.10122delT|p.S3376fs|10122/21290|10122/21290|3374/7095||WARNING_TRANSCRIPT_MULTIPLE_STOP_CODONS;LOF=(ORF1ab|gene-GU280_gp01|1|1.00);CONS_HMM_SARS_COV_2=0.57215;CONS_HMM_SARBECOVIRUS=0.57215;CONS_HMM_VERTEBRATE_COV=0;PFAM_NAME=Peptidase_C30_CoV;PFAM_DESCRIPTION=Peptidase C30,coronavirus;vafator_af=0.103448;vafator_ac=3;vafator_dp=29",

Apparently, the PFAM_DESCRIPTION field does contain white spaces. A possible solution would affect both the pipeline and the processor. The pipeline would need to generate valid VCF. For instance replacing white spaces by under scores. The processor would need to replace back the under scores into white spaces when loading the data into the database. One possible problem in this implementation is that there may be other under scores in INFO fields that we don't want to replace by white spaces.

priesgo avatar Oct 24 '23 08:10 priesgo

Three options at least:

  • Escape white spaces with something like under scores
  • Escape white spaces with HTML codes and expect that IGV&friends parse this properly
  • Integrate pfam annotations in SnpEff reference for SARS-CoV-2 and see what SnpEff does with white spaces

priesgo avatar Oct 24 '23 08:10 priesgo

Fourth option: remove the Pfam long description altogether if not used in the dashboard

priesgo avatar Oct 24 '23 09:10 priesgo