hts-specs icon indicating copy to clipboard operation
hts-specs copied to clipboard

BCF Character/String type MISSING/EOV encoding

Open andersleung opened this issue 4 years ago • 1 comments

In BCF, the Character/String type does not have MISSING or EOV encoding given in the spec. htslib and GenomicsDB define MISSING and EOV for String/Character to be 0x07 and 0x00 respectively, but htslib only seems to convert 0x07 to . when converting BCF to VCF, but does not convert . to 0x07 when writing VCF as BCF.

My question is how a VCF record with missing Characters and missing Strings are encoded in BCF. If the spec is following htslib, I think missing Character should be defined to be encoded as a length 1 String whose only byte is 0x07, and a missing String, being an entirely missing vector of Character, would be [0x07,0x00,0x00,...] because of https://github.com/samtools/hts-specs/pull/617.

As a separate issue, it's not well defined what the Character type in VCF means. In BCF, Character is one 7-bit ASCII byte, but in VCF which is UTF-8 encoded, Character could be a byte, a Unicode codepoint, or a grapheme.

andersleung avatar Dec 22 '21 16:12 andersleung

I second this.

The specification of (partly) empty vectors is really inprecise. See also #593.

h-2 avatar Jan 25 '22 12:01 h-2