endo icon indicating copy to clipboard operation
endo copied to clipboard

fix(marshal)!: compare strings by codepoint

Open erights opened this issue 1 year ago • 4 comments

closes: #2113 refs: #2002

Description

  • JavaScript's relational comparison operations as exposed in methods like Array.prototype.sort and operators </<=/>=/> compare strings by lexicographic UTF-16 code unit order, which is exposes an internal representational detail not relevant to the string's meaning as a Unicode string. Previously, compareRank and associated functions compared strings using this JavaScript-native comparison. Now compareRank and associated functions compare strings by lexicographic Unicode code point order. This change only affects comparison of characters in the range U+E000 through U+FFFF vs. supplementary-plane characters starting at U+10000 [i.e., those whose code point does not fit in 16 bits], and therefore only collections of strings including characters from both of those ranges.
    • This release does not change the encodePassable encoding. But now, when we say it is order preserving, we need to be careful about which order we mean. encodePassable is rank-order preserving when the encoded strings are compared using compareRank.
    • The key order of strings defined by the @endo/patterns module is still defined to be the same as the rank ordering of those strings. So this release changes key order among strings to also be lexicographic comparison of Unicode code points. To accommodate this change, you may need to adapt applications that relied on key-order being the same as JS native order. This could include the use of any patterns expressing key inequality tests, like M.gte(string).

Security Considerations

The fact that the string ordering is closer to the Unicode semantics of the strings probably minimizes some surprises in ways that help security. OTOH, this difference from JS native string ordering probably causes other surprises that hurt security. Altogether, we do not expect much effect.

Scaling Considerations

As a comparison written in JS, will be slower that the JS native string comparison. On XS at least, we expect to have a native code point comparison function available eventually. Altogether, we do not expect much effect.

Documentation Considerations

Most developers will not care. But it needs to be explained somewhere carefully so that developers that do care can easily find out.

Testing Considerations

@gibson042 , in a later PR, could you expand the property-based-testing to generate test cases sensitive to this change?

Compatibility Considerations

  • These string ordering changes brings Endo into conformance with any string ordering components of the OCapN standard.
  • To accommodate these change, you may need to adapt applications that relied on rank-order or key-order being the same as JS native order. You may need to resort any data that had previously been rank sorted using the prior compareRank function. You may need to revisit any use of patterns like M.gte(string) expressing inequalities over strings.

Upgrade Considerations

If we currently have any persistent data, especially on chain, sorted according to JS native order (by UTF-16 code unit), then we cannot accept this PR until we have a plan to resort that data, or somehow continue to live with mis-sorted. (Historical note: This is how Oracle came to permanently rely on UTF-16 code unit order, because of the impracticality of resorting all that data.)

  • [ ] Includes *BREAKING*: in the commit message with migration instructions for any breaking change.
  • [x] Updates NEWS.md for user-facing changes.

erights avatar Jan 25 '24 21:01 erights

Excellent.

For this change, I do not think we can avoid the breaking change marker. That might render my argument for leaving it out of pass-style, moot.

Let me be sure I understand:

You're saying that this PR should keep the "!". Given that, we may as well keep the "!" on #2002 as well. Right?

erights avatar Jan 26 '24 01:01 erights

Yes

On Thu, Jan 25, 2024 at 5:47 PM Mark S. Miller @.***> wrote:

Excellent.

For this change, I do not think we can avoid the breaking change marker. That might render my argument for leaving it out of pass-style, moot.

Let me be sure I understand:

You're saying that this PR should keep the "!". Given that, we may as well keep the "!" on #2002 https://github.com/endojs/endo/pull/2002 as well. Right?

— Reply to this email directly, view it on GitHub https://github.com/endojs/endo/pull/2008#issuecomment-1911279409, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAOXBRXXSUOYVBOWVGDT4TYQMDLVAVCNFSM6AAAAABCLFMFD2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJRGI3TSNBQHE . You are receiving this because you commented.Message ID: @.***>

kriskowal avatar Jan 26 '24 02:01 kriskowal

Just noting here for curiosity. In the UTF16 portion of https://icu-project.org/docs/papers/utf16_code_point_order.html

This opens the door for a "fix-up" of code unit values that is faster than assembling 21-bit code point values.

OMG

erights avatar Jan 29 '24 05:01 erights