sgkit icon indicating copy to clipboard operation
sgkit copied to clipboard

Cannot round-trip explicitly set missing INFO values in VCF

Open jeromekelleher opened this issue 1 year ago • 5 comments

The all_fields.vcf file contains lots of examples where we explicitly state that an INFO key is missing, rather than omitting the key, e.g. II1=. and II2=.,. here. This was handled before #1190 because we treating non-present INFO keys as PAD values and only these explicit "key=." values as missing.

I don't think it's a useful distinction, and likely to cause more problems downstream if we distinguish between these two types of missingness. I'm fairly clear that regarding missing keys as dimension padding isn't helpful, in any case.

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  s1      s2
1       1       .       G       A,C     .       PASS    IB0     .       .       .
1       2       .       A       G,G     .       PASS    II1=126 .       .       .
1       3       .       A       G,G     .       PASS    II1=.   .       .       .
1       4       .       T       A,C     .       PASS    II2=459,-140    .       .       .
1       5       .       T       A,C     .       PASS    II2=.,-140      .       .       .
1       6       .       T       A,C     .       PASS    II2=459,.       .       .       .
1       7       .       T       A,C     .       PASS    II2=.,. .       .       .

However, it seems that bcftools at least does make this distinction, and losslessly roundtrips this VCF through BCF.

My suggestion here is that we just edit the all_fields.vcf file to remove all-missing values. This seems like a pretty niche problem, and probably something we'd need to deal with explicitly at the spec level rather than here. It's not worth getting bogged down on, I think.

jeromekelleher avatar Feb 16 '24 11:02 jeromekelleher

Sounds good - all_fields.vcf is not a VCF from the wild, so it's OK to change it.

tomwhite avatar Feb 16 '24 11:02 tomwhite

Do you have the script for generating it?

jeromekelleher avatar Feb 16 '24 11:02 jeromekelleher

See https://github.com/pystatgen/sgkit/blob/main/sgkit/tests/io/vcf/test_vcf_generator.py

tomwhite avatar Feb 16 '24 11:02 tomwhite