JSON-Schema-Test-Suite
JSON-Schema-Test-Suite copied to clipboard
Add tests to ensure Unicode string comparison is by codepoint
§4.2.2 of the specification says, for string equality, that it is defined by:
[...] both are strings, and are the same codepoint-for-codepoint; [...]
We should add some tests to specifically ensure that a particular grapheme (e.g. ä) is not equal to the same grapheme represented by different code point(s).
Also, that two graphemes that look the same aren't identical (for example µ (MICRO SIGN, U+00B5) is not the same as μ (GREEK SMALL LETTER MU, U+2192).
both are strings, and are the same codepoint-for-codepoint
This is actually kind of unfortunate wording, as there can be two graphemes that are identical but differ by codepoint, because of the ordering of the combining characters (underlining, diacritics, accent marks). See http://www.unicode.org/reports/tr15/ on normalization forms - ideally we should specify that string comparisons are done on normalized strings, but I fear the overhead may be prohibitive to some implementations (especially when "most" strings are ascii anyway).
as there can be two graphemes that are identical but differ by codepoint, because of the ordering of the combining characters
I wrote the "codepoint-for-codepoint" text, in my basic understanding, there's multiple different ways to compare strings for equality, and preserving codepoints exactly is the only strategy that doesn't lose information. There's a potential that picking any other comparison algorithm could break compatibility with JSON.
Since JSON is Unicode, you can renormalize the whole document before validating it, and JSON Schema doesn't need to (normatively) mention this. You can also write your schema to use a particular normal form, and you tell applications "this schema expects [e.g.] NFC."
I am looking for a first issue - and happy to have a crack at this one.
Would it be acceptable to put together a few cases based on the const keyword from draft 6?
Something like:
[
{
"description": "mu equality",
"schema": {
"$schema": "https://json-schema.org/draft/2019-09/schema",
"const": "μ"
},
"tests": [
{
"description": "mu is valid",
"data": "μ",
"valid": true
},
{
"description": "micro is invalid",
"data": "µ",
"valid": false
},
{
"description": "mu code point is valid",
"data": "\u03bc",
"valid": true
},
{
"description": "micro code point is invalid",
"data": "\u00b5",
"valid": false
}
]
}
]
I am looking for a first issue - and happy to have a crack at this one.
Thanks! That'd be really helpful!
Yep that looks good I think (other than better to mention we're testing codepoint equivalence than to mention mu in the description of the test case, that's not particularly relevant).
If we want to be extra nice probably having a second example for combining marks like the original description, i.e. "a\u0308" != "\u00e4".
OK cool - thanks @Julian - point taken
I had a second thought after I sent the comment though
is "data" in the test schema meant to be a stand-in for JSON instance data?
basically what I'm getting at is that "\u03bc" is a perfectly valid string literal and that maybe we need to differentiate between cases where the characters that comprise a codepoint representation are being sent over the wire (or whatever) as opposed to the character identified by the codepoint
and I'm not sure whether the last two tests are correct or not
will have a think & a read and raise a PR in due course
comments appreciated in the meantime though
is "data" in the test schema meant to be a stand-in for JSON instance data?
Yes, data means "the instance you're testing". IIRC it predates us commonly using the word "instance" which is why it's named that way.
It's the JSON parser that someone uses's job to turn "\u03bc" into "μ" as part of deserializing (as opposed to JSON Schema), so really there isn't likely to be any difference there, but it's fine/good I think to have both representations.
Opened a PR
the tests are on draft6 since this is the earliest draft of the spec with the 'codepoint-for-codepoint' language.
I'm not proposing to add the tests for the suites for drafts 3 & 4 since those drafts just mention that strings should have the "same value" - and don't specify how those values are derived
but if these pass the smell test I'd be happy to copy the tests into the test suites for the other draft specs - I just didn't want to crowd the PR with too much duplicate text straight off the bat
It's the JSON parser that someone uses's job to turn "\u03bc" into "μ"
I think this is the key. Since a parser is already doing the conversion, any tests we write are checking the parser, not the validator.