fix(marshal)!: compare strings by codepoint
closes: #2113 refs: #2002
Description
- JavaScript's relational comparison operations as exposed in methods like
Array.prototype.sortand operators</<=/>=/>compare strings by lexicographic UTF-16 code unit order, which is exposes an internal representational detail not relevant to the string's meaning as a Unicode string. Previously,compareRankand associated functions compared strings using this JavaScript-native comparison. NowcompareRankand associated functions compare strings by lexicographic Unicode code point order. This change only affects comparison of characters in the range U+E000 through U+FFFF vs. supplementary-plane characters starting at U+10000 [i.e., those whose code point does not fit in 16 bits], and therefore only collections of strings including characters from both of those ranges.- This release does not change the
encodePassableencoding. But now, when we say it is order preserving, we need to be careful about which order we mean.encodePassableis rank-order preserving when the encoded strings are compared usingcompareRank. - The key order of strings defined by the @endo/patterns module is still defined to be the same as the rank ordering of those strings. So this release changes key order among strings to also be lexicographic comparison of Unicode code points. To accommodate this change, you may need to adapt applications that relied on key-order being the same as JS native order. This could include the use of any patterns expressing key inequality tests, like
M.gte(string).
- This release does not change the
Security Considerations
The fact that the string ordering is closer to the Unicode semantics of the strings probably minimizes some surprises in ways that help security. OTOH, this difference from JS native string ordering probably causes other surprises that hurt security. Altogether, we do not expect much effect.
Scaling Considerations
As a comparison written in JS, will be slower that the JS native string comparison. On XS at least, we expect to have a native code point comparison function available eventually. Altogether, we do not expect much effect.
Documentation Considerations
Most developers will not care. But it needs to be explained somewhere carefully so that developers that do care can easily find out.
Testing Considerations
@gibson042 , in a later PR, could you expand the property-based-testing to generate test cases sensitive to this change?
Compatibility Considerations
- These string ordering changes brings Endo into conformance with any string ordering components of the OCapN standard.
- To accommodate these change, you may need to adapt applications that relied on rank-order or key-order being the same as JS native order. You may need to resort any data that had previously been rank sorted using the prior
compareRankfunction. You may need to revisit any use of patterns likeM.gte(string)expressing inequalities over strings.
Upgrade Considerations
If we currently have any persistent data, especially on chain, sorted according to JS native order (by UTF-16 code unit), then we cannot accept this PR until we have a plan to resort that data, or somehow continue to live with mis-sorted. (Historical note: This is how Oracle came to permanently rely on UTF-16 code unit order, because of the impracticality of resorting all that data.)
- [ ] Includes
*BREAKING*:in the commit message with migration instructions for any breaking change. - [x] Updates
NEWS.mdfor user-facing changes.
Excellent.
For this change, I do not think we can avoid the breaking change marker. That might render my argument for leaving it out of
pass-style, moot.
Let me be sure I understand:
You're saying that this PR should keep the "!". Given that, we may as well keep the "!" on #2002 as well. Right?
Yes
On Thu, Jan 25, 2024 at 5:47 PM Mark S. Miller @.***> wrote:
Excellent.
For this change, I do not think we can avoid the breaking change marker. That might render my argument for leaving it out of pass-style, moot.
Let me be sure I understand:
You're saying that this PR should keep the "!". Given that, we may as well keep the "!" on #2002 https://github.com/endojs/endo/pull/2002 as well. Right?
— Reply to this email directly, view it on GitHub https://github.com/endojs/endo/pull/2008#issuecomment-1911279409, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAOXBRXXSUOYVBOWVGDT4TYQMDLVAVCNFSM6AAAAABCLFMFD2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJRGI3TSNBQHE . You are receiving this because you commented.Message ID: @.***>
Just noting here for curiosity. In the UTF16 portion of https://icu-project.org/docs/papers/utf16_code_point_order.html
This opens the door for a "fix-up" of code unit values that is faster than assembling 21-bit code point values.
OMG