hts-specs icon indicating copy to clipboard operation
hts-specs copied to clipboard

Percent encoding is under-defined

Open d-cameron opened this issue 9 months ago • 4 comments

The intent of the percent encoding was to ensure that reserved characters can be presented in strings in their fields they are reserved. The spec lists a set of characters with special meaning does not explicitly state which characters are reserved in which fields.

This has resulted in noodles performing percent encoding of colons in the INFO field even though it doesn't need to: https://github.com/zaeleus/noodles/issues/339

The specs should explicitly state which characters must be percent encoded in which fields and what implementation should do when reading files with encoded non-reserved characters. My preference is to clarify that:

  • Only reserved characters should be encoded
    • The list in section 1.2 should be turned into a table indicating which fields reserve which characters
  • Parses should decode all percent encoded character whenever such an encoding decodes to a valid character (for compatibility with older VCF that have literal %s - e.g. ...;FIELD=50%;...

d-cameron avatar May 28 '25 06:05 d-cameron

The current rule is clear that "for any other meaning [the following characters] must be represented with the capitalized percent encoding" (§ 1.2 (2024-10-09)), supporting that percent-encoding the colon (:) in an INFO field (value) is compliant. I don't oppose subsetting rules for different fields, but that is a change in behavior, not a clarification.

This proposal also contradicts a previous comment (https://github.com/samtools/hts-specs/issues/689#issuecomment-1352526708) that the specified characters in § 1.2 are supposed to be encoded and decoded in field values:

IIRC, the intent was for [...] (2) - encode/decode all 8 reserved values

zaeleus avatar May 28 '25 14:05 zaeleus

  • The list in section 1.2 should be turned into a table indicating which fields reserve which characters

Does this represent the expected explicit encoding rules for INFO and sample field values?

character percent encoding INFO field value sample field value
"non-printable characters" ... yes yes
: (colon) %3A no yes
; (semicolon) %3B yes no
= (equal sign) %3D yes no
% (percent sign) %25 yes yes
, (comma) %2C yes yes
. (period) %2E yes yes
CR (carriage return) %0D yes yes
LF (line feed) %0A yes yes
HT (horizontal tab) %09 yes yes

Edit 2025-11-11: Added . (period) to the table. This is used as a marker for missing values in both INFO and sample fields.

zaeleus avatar Aug 20 '25 18:08 zaeleus

Discussed during the monthly call. JB in agreement with Daniel regarding only encoding where necessary. Also agreed above table would be an improvement.

jkbonfield avatar Oct 14 '25 15:10 jkbonfield

I added . (period) to the table above. It is used as a marker for missing values in both INFO and sample fields. (But also note that the former is still not clarified in the spec. See #609.)

Edit: For string values, this encoding would only apply to the specific string "."

zaeleus avatar Nov 11 '25 22:11 zaeleus