icu4x
icu4x copied to clipboard
Consistently deal with string encodings over FFI
Currently we use &str
over FFI when we're accepting strings from the user (which is not that common, but crops up in segmenter, collator, etc).
Over FFI, this maps to std::string_view
in C++ and a String
in JS (that gets autoconverted to UTF8, throwing on surrogate codepoints). string_view
is not guaranteed UTF8 in C++.
Eventually I would like Diplomat to allow for:
- A config option that forces
str
to get validated over C++ FFI before being passed in. (This can be done withotu changing APIs) - More string types: one for UTF16, etc
- Perhaps a way to differentiate between "assume-validated" strings, "unvalidated" strings, and "diplomat-will-validate-for-you" strings
But for now we shoudl be consistent. The easiest way to do this safely is to always accept &str
but to always validate it as valid UTF8 immediately. The alternative is to accept &str
and document that it's up to C++ callers to ensure it's UTF-8.
Collator's happening in https://github.com/unicode-org/icu4x/pull/2498, but I think we'll need to handle this for bidi and segmenter. It ought to be a matter of simply casting to a byte slice, validating it, and returning an error if not.
cc @sffc @hsivonen
I'm not convinced that consistency at the cost of perf is the right way to go. I am in favor of consistency in terms of types, i.e. std::string_view
for UTF-8-ish input, but I have doubts about it being good to favor consistency on the point of what memory-safe behavior (early error vs. as-if-replacement-characters) takes places in case of ill-formedness. Notably, an up-front well-formedness check pessimizes the collator more than letting the collator check a string for well-formedness as it goes, because the collator returns early when it finds a primary-level difference early in the string.
Is it schedule-wise infeasible for 1.0 to have a Diplomat directive that would allow a binding to choose from two options for &str
-as-argument treatment for a given method?
- Check well-formedness on Diplomat layer error out early, otherwise route to
&str
back end (for segmenter and bidi; I gather they don't have the back end code to deal with potentially-ill-formed UTF-8) - Don't check well-formedness and route to the back end's
&[u8]
flavor when the back end can deal (for normalizer and collator, which are both designed to deal with the three cases: guaranteed-well-formed UTF-8, potentially-ill-formed UTF-8, and potentially-ill-formed UTF-16).
Also, would the check go on the C++ layer only or on the C/FFI layer? I.e. would JS go through a conversion from UTF-16 to guaranteed-well-format UTF-8 and still undergo a UTF-8 well-formedness check subsequently.
At the moment? It would have to go through the FFI layer. In the future we can make this configurable on Diplomat so that it automatically does this for C++.
(It is not a hard change to do and I am open to having that configurability for C++ now as well, however I personally have higher priority things to work on and this can be done post-1.0 without breaking things so I'd rather punt.)
Considering that the normalizer and the collator have been designed to be able to deal with all three of
- guaranteed-well-formed UTF-8
- potentially-ill-formed UTF-8
- potentially-ill-formed UTF-16
performance-wise optimally, I'm uncomfortable with not letting that show through to the intended beneficiaries. Even though there are actual perf problems in comparison with ICU4C even given perfect FFI, it would be sad for ICU4X normalizer & collator to be perceived as slower than they have to be perceived as due to extra validation that doesn't logically have to be there.
(But then, my current priorities don't allow for committing to implementing the requisite Diplomat configurability at this time.)
I don't see how this issue precludes any of that.
For APIs that support different kinds of inputs we can have different methods. We can even add the unvalidated one later if we don't want to start with
We can also fix the problem with the validation cost later.
Neither of these things are breaking.
We do not have time to do the ideal FFI for 1.0, this issue is about what we need to do for 1.0. Fortunately this can be fixed shortly afterwards without breaking, provided we prep for it and that's what this is about.
Okay, so here's a plan, which I discussed a bit with @sffc
The general idea is that for 1.0 we will expose std::string_view
APIs but not assume valid UTF8 (except in one case around bidi), plus additional utf16/etc APIs if desired. In 1.x we will have the Diplomat support to avoid any double-validation cost and expose more thoughtful per-language APIs.
The 1.x plans will need https://github.com/rust-diplomat/diplomat/issues/240 (plus more configurability around string types) , https://github.com/rust-diplomat/diplomat/issues/233, and perhaps the renaming part of https://github.com/rust-diplomat/diplomat/issues/234. 1.0 plans are doable today with some minor fixes to ICU4X APIs.
There are three categories of stringy APIs today:
If potentially invalid utf8 is supported as an API
For collator and normalizer, and segmenter. Segmenter doesn't support potentially invalud utf8 now, but it can, by using https://docs.rs/utf8_iter/latest/utf8_iter/)
Using https://github.com/unicode-org/icu4x/pull/2498's compare
as an example, but would also apply to the normalizer functions:
For 1.0:
-
foo()
takes&str
(std::string_view
in cpp), use the potentially invalid utf8 path -
foo_utf16()
takes&[u16]
(std::span<uint16_t>
), assumes potentially invalid
For 1.x:
-
foo()
takes&[u8]
(on the Rust side) but still maps tostd::string_view
andString
(via some annotation) in C++ and JS, uses the potentially invalid utf8 path- We can use function-renaming so that the JS function can use the valid UTF8 codepath instead (i.e., we'd have
foo_valid_utf8()
in JS but renamed tofoo()
andfoo()
is disabled.
- We can use function-renaming so that the JS function can use the valid UTF8 codepath instead (i.e., we'd have
-
foo_valid_utf8()
takes&str
, maps tostd::string_view
in C++. Is disabled or renamed for JS. -
foo_utf16()
takes&[u16]
, maps to nice u16 string types for C++ and JS, still assumes potentially invalid since we do not require surrogate pairs -
foo_valid_utf16()
could be added in the future though it's not something asked for yet and I don't think we need it
APIs that need UTF8
ListFormatter is an example of this, as well as FixedDecimal, Locale, etc. FsDataProvider should move over to bytes. We may for now exempt bidi
from this and have it just documented as requiring valid UTF8.
For 1.0:
Accept &str
, but revalidate in Rust bridge code, JS eats double validation cost. ListFormatter needs to allocate anyway so this isn't a huge deal.
If possible, use a bytes API instead (e.g. FixedDecimal, Locale, PluralOperands, etc), exposed as &str
but not assumed to be valid.
For 1.x:
Potentially two APIs: default API accepts &[u8]
but maps to idiomatic string types over FFI, pays validation cost only if necessary (so not in JS). Additional API can exist for C++ that assumes valid UTF8.
Stringy APIs around identifiers
These go to TinyStr, we should just ensure we use TinyStr::from_bytes()
, and switch over to std::string_view
&[u8]
. After 1.x we can have Diplomat generate &[u8]
APIs that map to idiomatic string types over FFI to prevent footguns.
The UTF-8 part of the above makes sense. Thank you.
Why do you say &[u16]
needs to be validated? What UTF-16-taking APIs do we have that don't already deal with unpaired surrogates? Both the collator and the normalizer deal with ill-formed UTF-16 by as-iffing unpaired surrogates to U+FFFD, so up-front validation is unwanted. Do callers that have guaranteed-well-formed UTF-16 even exist?
Why do you say
&[u16]
needs to be validated? What UTF-16-taking APIs do we have that don't already deal with unpaired surrogates? Both the collator and the normalizer deal with ill-formed UTF-16 by as-iffing unpaired surrogates to U+FFFD, so up-front validation is unwanted. Do callers that have guaranteed-well-formed UTF-16 even exist?
Sorry, I should rephrase, when I say "needs to be validated", I mean "should not be assumed to be valid", which in our case means we don't validate (but we also don't write code assuming validity)
- @younies - How the Bi-di algorithm works is that no matter the encoding, it gives levels to chunks. So the algorithm seems easy to update.
- @Manishearth - The APIs would need to be updated, too. Everything uses
&str
but we need it to take&[u8]
. - @younies - That sounds like something doable.
- @Manishearth - Yeah, we'd just need to design that layer. It's an API design question. It's not technically challenging; it's challenging from a library design point of view.
- @younies - OK; we can file a bug and do it post-1.0.
- @robertbastian - I'm wondering what our motivation is to support invalid UTF-8. That seems like it is catering for bad programming practices.
- @Manishearth - Invalid UTF-8 exists; in C++ it's not the norm for strings to be guaranteed valid UTF-8. It's a Rust safety thing. The situation is different from UTF-16; languages like Java and JavaScript don't guarantee that the UTF-16 is valid. So this reflects norms over FFI into other languages. I recognize that as ICU4X we should try to push the needle, but we shouldn't sacrifice safety. We deal with untrusted string inputs. It's a matter of norms.
- @sffc - It's like GIGO behavior, not dissimilar to data provider. It's better to have GIGO than UB.
With #2534 and #2542, the 1.0 plan has been fully implemented. This is now a 1.x issue.