JSON-Schema-Test-Suite icon indicating copy to clipboard operation
JSON-Schema-Test-Suite copied to clipboard

Add tests to ensure Unicode string comparison is by codepoint

Open Julian opened this issue 3 years ago • 8 comments
trafficstars

§4.2.2 of the specification says, for string equality, that it is defined by:

[...] both are strings, and are the same codepoint-for-codepoint; [...]

We should add some tests to specifically ensure that a particular grapheme (e.g. ä) is not equal to the same grapheme represented by different code point(s).

Julian avatar Jul 05 '22 12:07 Julian

Also, that two graphemes that look the same aren't identical (for example µ (MICRO SIGN, U+00B5) is not the same as μ (GREEK SMALL LETTER MU, U+2192).

both are strings, and are the same codepoint-for-codepoint

This is actually kind of unfortunate wording, as there can be two graphemes that are identical but differ by codepoint, because of the ordering of the combining characters (underlining, diacritics, accent marks). See http://www.unicode.org/reports/tr15/ on normalization forms - ideally we should specify that string comparisons are done on normalized strings, but I fear the overhead may be prohibitive to some implementations (especially when "most" strings are ascii anyway).

karenetheridge avatar Aug 29 '22 00:08 karenetheridge

as there can be two graphemes that are identical but differ by codepoint, because of the ordering of the combining characters

I wrote the "codepoint-for-codepoint" text, in my basic understanding, there's multiple different ways to compare strings for equality, and preserving codepoints exactly is the only strategy that doesn't lose information. There's a potential that picking any other comparison algorithm could break compatibility with JSON.

Since JSON is Unicode, you can renormalize the whole document before validating it, and JSON Schema doesn't need to (normatively) mention this. You can also write your schema to use a particular normal form, and you tell applications "this schema expects [e.g.] NFC."

awwright avatar Sep 28 '22 18:09 awwright

I am looking for a first issue - and happy to have a crack at this one.

Would it be acceptable to put together a few cases based on the const keyword from draft 6?

Something like:

[
    {
        "description": "mu equality",
        "schema": {
            "$schema": "https://json-schema.org/draft/2019-09/schema",
            "const": "μ"
        },
        "tests": [
            {
                "description": "mu is valid",
                "data": "μ",
                "valid": true
            },
            {
                "description": "micro is invalid",
                "data": "µ",
                "valid": false
            },
            {
                "description": "mu code point is valid",
                "data": "\u03bc",
                "valid": true
            },
            {
                "description": "micro code point is invalid",
                "data": "\u00b5",
                "valid": false
            }
        ]
     }
  ]

zaplapl avatar Mar 31 '24 14:03 zaplapl

I am looking for a first issue - and happy to have a crack at this one.

Thanks! That'd be really helpful!

Yep that looks good I think (other than better to mention we're testing codepoint equivalence than to mention mu in the description of the test case, that's not particularly relevant).

If we want to be extra nice probably having a second example for combining marks like the original description, i.e. "a\u0308" != "\u00e4".

Julian avatar Apr 01 '24 12:04 Julian

OK cool - thanks @Julian - point taken

I had a second thought after I sent the comment though

is "data" in the test schema meant to be a stand-in for JSON instance data?

basically what I'm getting at is that "\u03bc" is a perfectly valid string literal and that maybe we need to differentiate between cases where the characters that comprise a codepoint representation are being sent over the wire (or whatever) as opposed to the character identified by the codepoint

and I'm not sure whether the last two tests are correct or not

will have a think & a read and raise a PR in due course

comments appreciated in the meantime though

zaplapl avatar Apr 02 '24 15:04 zaplapl

is "data" in the test schema meant to be a stand-in for JSON instance data?

Yes, data means "the instance you're testing". IIRC it predates us commonly using the word "instance" which is why it's named that way.

It's the JSON parser that someone uses's job to turn "\u03bc" into "μ" as part of deserializing (as opposed to JSON Schema), so really there isn't likely to be any difference there, but it's fine/good I think to have both representations.

Julian avatar Apr 02 '24 15:04 Julian

Opened a PR

the tests are on draft6 since this is the earliest draft of the spec with the 'codepoint-for-codepoint' language.

I'm not proposing to add the tests for the suites for drafts 3 & 4 since those drafts just mention that strings should have the "same value" - and don't specify how those values are derived

but if these pass the smell test I'd be happy to copy the tests into the test suites for the other draft specs - I just didn't want to crowd the PR with too much duplicate text straight off the bat

zaplapl avatar Apr 14 '24 11:04 zaplapl

It's the JSON parser that someone uses's job to turn "\u03bc" into "μ"

I think this is the key. Since a parser is already doing the conversion, any tests we write are checking the parser, not the validator.

gregsdennis avatar Apr 14 '24 12:04 gregsdennis