icu4x Consistently deal with string encodings over FFI

Consistently deal with string encodings over FFI

Open Manishearth opened this issue 1 year ago • 10 comments

Currently we use &str over FFI when we're accepting strings from the user (which is not that common, but crops up in segmenter, collator, etc).

Over FFI, this maps to std::string_view in C++ and a String in JS (that gets autoconverted to UTF8, throwing on surrogate codepoints). string_view is not guaranteed UTF8 in C++.

Eventually I would like Diplomat to allow for:

A config option that forces str to get validated over C++ FFI before being passed in. (This can be done withotu changing APIs)
More string types: one for UTF16, etc
Perhaps a way to differentiate between "assume-validated" strings, "unvalidated" strings, and "diplomat-will-validate-for-you" strings

But for now we shoudl be consistent. The easiest way to do this safely is to always accept &str but to always validate it as valid UTF8 immediately. The alternative is to accept &str and document that it's up to C++ callers to ensure it's UTF-8.

Collator's happening in https://github.com/unicode-org/icu4x/pull/2498, but I think we'll need to handle this for bidi and segmenter. It ought to be a matter of simply casting to a byte slice, validating it, and returning an error if not.

cc @sffc @hsivonen

Sep 06 '22 17:09 Manishearth

I'm not convinced that consistency at the cost of perf is the right way to go. I am in favor of consistency in terms of types, i.e. std::string_view for UTF-8-ish input, but I have doubts about it being good to favor consistency on the point of what memory-safe behavior (early error vs. as-if-replacement-characters) takes places in case of ill-formedness. Notably, an up-front well-formedness check pessimizes the collator more than letting the collator check a string for well-formedness as it goes, because the collator returns early when it finds a primary-level difference early in the string.

Is it schedule-wise infeasible for 1.0 to have a Diplomat directive that would allow a binding to choose from two options for &str-as-argument treatment for a given method?

Check well-formedness on Diplomat layer error out early, otherwise route to &str back end (for segmenter and bidi; I gather they don't have the back end code to deal with potentially-ill-formed UTF-8)
Don't check well-formedness and route to the back end's &[u8] flavor when the back end can deal (for normalizer and collator, which are both designed to deal with the three cases: guaranteed-well-formed UTF-8, potentially-ill-formed UTF-8, and potentially-ill-formed UTF-16).

Sep 07 '22 06:09 hsivonen

Also, would the check go on the C++ layer only or on the C/FFI layer? I.e. would JS go through a conversion from UTF-16 to guaranteed-well-format UTF-8 and still undergo a UTF-8 well-formedness check subsequently.

Sep 07 '22 06:09 hsivonen

At the moment? It would have to go through the FFI layer. In the future we can make this configurable on Diplomat so that it automatically does this for C++.

(It is not a hard change to do and I am open to having that configurability for C++ now as well, however I personally have higher priority things to work on and this can be done post-1.0 without breaking things so I'd rather punt.)

Sep 07 '22 06:09 Manishearth

Considering that the normalizer and the collator have been designed to be able to deal with all three of

guaranteed-well-formed UTF-8
potentially-ill-formed UTF-8
potentially-ill-formed UTF-16

performance-wise optimally, I'm uncomfortable with not letting that show through to the intended beneficiaries. Even though there are actual perf problems in comparison with ICU4C even given perfect FFI, it would be sad for ICU4X normalizer & collator to be perceived as slower than they have to be perceived as due to extra validation that doesn't logically have to be there.

(But then, my current priorities don't allow for committing to implementing the requisite Diplomat configurability at this time.)

Sep 07 '22 06:09 hsivonen

I don't see how this issue precludes any of that.

For APIs that support different kinds of inputs we can have different methods. We can even add the unvalidated one later if we don't want to start with

We can also fix the problem with the validation cost later.

Neither of these things are breaking.

We do not have time to do the ideal FFI for 1.0, this issue is about what we need to do for 1.0. Fortunately this can be fixed shortly afterwards without breaking, provided we prep for it and that's what this is about.

Sep 07 '22 14:09 Manishearth

Okay, so here's a plan, which I discussed a bit with @sffc

The general idea is that for 1.0 we will expose std::string_view APIs but not assume valid UTF8 (except in one case around bidi), plus additional utf16/etc APIs if desired. In 1.x we will have the Diplomat support to avoid any double-validation cost and expose more thoughtful per-language APIs.

The 1.x plans will need https://github.com/rust-diplomat/diplomat/issues/240 (plus more configurability around string types) , https://github.com/rust-diplomat/diplomat/issues/233, and perhaps the renaming part of https://github.com/rust-diplomat/diplomat/issues/234. 1.0 plans are doable today with some minor fixes to ICU4X APIs.

There are three categories of stringy APIs today:

If potentially invalid utf8 is supported as an API

For collator and normalizer, and segmenter. Segmenter doesn't support potentially invalud utf8 now, but it can, by using https://docs.rs/utf8_iter/latest/utf8_iter/)

Using https://github.com/unicode-org/icu4x/pull/2498's compare as an example, but would also apply to the normalizer functions:

For 1.0:

foo() takes &str (std::string_view in cpp), use the potentially invalid utf8 path
foo_utf16() takes &[u16] (std::span<uint16_t>), assumes potentially invalid

For 1.x:

foo() takes &[u8] (on the Rust side) but still maps to std::string_view and String (via some annotation) in C++ and JS, uses the potentially invalid utf8 path
- We can use function-renaming so that the JS function can use the valid UTF8 codepath instead (i.e., we'd have foo_valid_utf8() in JS but renamed to foo() and foo() is disabled.
foo_valid_utf8() takes &str, maps to std::string_view in C++. Is disabled or renamed for JS.
foo_utf16() takes &[u16], maps to nice u16 string types for C++ and JS, still assumes potentially invalid since we do not require surrogate pairs
foo_valid_utf16() could be added in the future though it's not something asked for yet and I don't think we need it

APIs that need UTF8

ListFormatter is an example of this, as well as FixedDecimal, Locale, etc. FsDataProvider should move over to bytes. We may for now exempt bidi from this and have it just documented as requiring valid UTF8.

For 1.0:

Accept &str, but revalidate in Rust bridge code, JS eats double validation cost. ListFormatter needs to allocate anyway so this isn't a huge deal.

If possible, use a bytes API instead (e.g. FixedDecimal, Locale, PluralOperands, etc), exposed as &str but not assumed to be valid.

For 1.x:

Potentially two APIs: default API accepts &[u8] but maps to idiomatic string types over FFI, pays validation cost only if necessary (so not in JS). Additional API can exist for C++ that assumes valid UTF8.

Stringy APIs around identifiers

These go to TinyStr, we should just ensure we use TinyStr::from_bytes(), and switch over to std::string_view &[u8]. After 1.x we can have Diplomat generate &[u8] APIs that map to idiomatic string types over FFI to prevent footguns.

Sep 07 '22 18:09 Manishearth

The UTF-8 part of the above makes sense. Thank you.

Why do you say &[u16] needs to be validated? What UTF-16-taking APIs do we have that don't already deal with unpaired surrogates? Both the collator and the normalizer deal with ill-formed UTF-16 by as-iffing unpaired surrogates to U+FFFD, so up-front validation is unwanted. Do callers that have guaranteed-well-formed UTF-16 even exist?

Sep 08 '22 06:09 hsivonen

Why do you say &[u16] needs to be validated? What UTF-16-taking APIs do we have that don't already deal with unpaired surrogates? Both the collator and the normalizer deal with ill-formed UTF-16 by as-iffing unpaired surrogates to U+FFFD, so up-front validation is unwanted. Do callers that have guaranteed-well-formed UTF-16 even exist?

Sorry, I should rephrase, when I say "needs to be validated", I mean "should not be assumed to be valid", which in our case means we don't validate (but we also don't write code assuming validity)

Sep 08 '22 14:09 Manishearth

@younies - How the Bi-di algorithm works is that no matter the encoding, it gives levels to chunks. So the algorithm seems easy to update.
@Manishearth - The APIs would need to be updated, too. Everything uses &str but we need it to take &[u8].
@younies - That sounds like something doable.
@Manishearth - Yeah, we'd just need to design that layer. It's an API design question. It's not technically challenging; it's challenging from a library design point of view.
@younies - OK; we can file a bug and do it post-1.0.
@robertbastian - I'm wondering what our motivation is to support invalid UTF-8. That seems like it is catering for bad programming practices.
@Manishearth - Invalid UTF-8 exists; in C++ it's not the norm for strings to be guaranteed valid UTF-8. It's a Rust safety thing. The situation is different from UTF-16; languages like Java and JavaScript don't guarantee that the UTF-16 is valid. So this reflects norms over FFI into other languages. I recognize that as ICU4X we should try to push the needle, but we shouldn't sacrifice safety. We deal with untrusted string inputs. It's a matter of norms.
@sffc - It's like GIGO behavior, not dissimilar to data provider. It's better to have GIGO than UB.

Sep 08 '22 18:09 sffc

With #2534 and #2542, the 1.0 plan has been fully implemented. This is now a 1.x issue.

Sep 09 '22 23:09 Manishearth

icu4x icu4x copied to clipboard

Consistently deal with string encodings over FFI

If potentially invalid utf8 is supported as an API

APIs that need UTF8

Stringy APIs around identifiers

icu4x
icu4x copied to clipboard