hts-specs icon indicating copy to clipboard operation
hts-specs copied to clipboard

Allow for UTF-8 field values in header regular expression

Open jmarshall opened this issue 2 years ago • 11 comments

Use [:print:] in the header regex and note that for ASCII it is equivalent to [ -~] and that the aim is to forbid control characters. Fixes #719.

To be honest, I'm tempted to add the extra [] to the \cclass definition and waste a bit of space each time this appears rather than add the “For brevity” sentence.

An alternative to this PR might be to just leave the regex as [ -~] and add a footnote explaining that this is an oversimplification for fields that allow Unicode values.

jmarshall avatar May 12 '23 09:05 jmarshall

Changed PDFs as of 229e998b93fe168dc107095d81b9a31ccc75483d: SAMv1 (diff).

github-actions[bot] avatar May 12 '23 09:05 github-actions[bot]

Changed PDFs as of 3b4e874ec5e426886baa251aed2e6cf671dee521: SAMv1 (diff).

github-actions[bot] avatar May 16 '23 20:05 github-actions[bot]

I'm unsure on the extra brackets also. My inclination though is it's probably not worth the hassle of inventing our own syntax and just going with the official double bracket style.

I'm guessing the extra brackets however were to permit things like [:[:alnum:]] being [:A-Za-z0-9] without ambiguity.

jkbonfield avatar May 17 '23 10:05 jkbonfield

~~Note that UTS #18: Annex C suggests :print: character class compatibility for Unicode as \p{graph}\p{blank}--\p{cntrl}, which is likely not the appropriate definition here since it includes :blank:. It may be better to note a property matcher instead, e.g., \P{Other}.~~

zaeleus avatar Jun 01 '23 02:06 zaeleus

Which specific :blank: characters are you worried about inadvertently including? (The obvious worry is TAB, but presumably that is removed by \p{cntrl}.)

jmarshall avatar Jun 01 '23 03:06 jmarshall

Ah, yes, you're right. I had a set operation wrong when I tested with an example. :print: seems sufficient for this change.

zaeleus avatar Jun 01 '23 04:06 zaeleus

I see I was assigned this in the last meeting.

Personally my preference is [[:print:]] over [:print:] as it's a standard and ironically the extra couple of characters we add a few times is less text than the "for brevity" statement. Not a hard "must be so" line, but if in agreement I'd prefer that before merging. Otherwise I'm happy with it.

jkbonfield avatar Jun 27 '23 08:06 jkbonfield

Can I get clarity on who is progressing this please? I was assigned, but gave my feedback over a year ago and it's unchanged. As far as I'm concerned, the ball is back with @jmarshall , but if you wish me to just make an editorial decision then I will amend the [:print:] to [[:print:]] and merge.

jkbonfield avatar Jan 07 '25 15:01 jkbonfield

The ball was indeed with me. See the new preview for how this uglifies [[:rname:^*=]][[:rname:]]*

jmarshall avatar Jan 28 '25 21:01 jmarshall

Changed PDFs as of ecdb25d1dbc4b0aeca5e760235ddb3f9da92d08c: SAMv1 (diff).

github-actions[bot] avatar Jan 28 '25 21:01 github-actions[bot]

Changed PDFs as of 3692643ba7ff22ba50180b777ef898aa1141056b: SAMv1 (diff).

github-actions[bot] avatar Jan 28 '25 21:01 github-actions[bot]