temporal icon indicating copy to clipboard operation
temporal copied to clipboard

Parsers should support unvalidated UTF16

Open Manishearth opened this issue 10 months ago • 3 comments

Currnetly the parsers, like Instant::from_str() requires validated UTF8 str.

JS engines typically have UTF16 or unvalidated UTF8.

It would be nice if these parsers were written to consume &[u8] (most of the parsing is ASCII only anyway), so we could at least operate on unvalidated UTF8 and have from_utf8_bytes() functions.

Ideally we also have UTF16 functions. That would need a tweak to the ixdtf parser.

Manishearth avatar May 01 '25 18:05 Manishearth

cc @nekevss

Manishearth avatar May 01 '25 18:05 Manishearth

A thing I may do over capi is adding from_str_utf8() from_str_utf16() that take DiplomatStr and DiplomatStr16, converting internally. Then over time we can make optimizations to avoid conversions/checking.

Manishearth avatar May 01 '25 18:05 Manishearth

Parsing unvalidated utf8 has been implemented in #295

HalidOdat avatar May 10 '25 18:05 HalidOdat

With unicode-org/icu4x#6577 merged, implementing UTF16 support should mostly be unblocked.

nekevss avatar May 28 '25 16:05 nekevss

Added a mention of Latin-1 to the issue.

Manishearth avatar Jul 01 '25 15:07 Manishearth

So are you thinking that instead of from_utf8, we rename to from_latin1?

I was sort of leaning towards keeping from_utf8 since the validation for values is ASCII anyways and would cause less confusion from the native Rust side of things.

But I'm also open to other alternatives:

  • from_bytes
  • from_latin1
  • from_ascii_bytes
  • from_utf8 (same)

nekevss avatar Jul 01 '25 15:07 nekevss

So are you thinking that instead of from_utf8, we rename to from_latin1?

No, we'd have both, like ICU4X.

But yes, since it's all ASCII anyway, the point might be moot.

Manishearth avatar Jul 01 '25 16:07 Manishearth

With #365 merged, this implementation is no longer blocked by ixdtf.

nekevss avatar Jul 04 '25 02:07 nekevss