libs-team icon indicating copy to clipboard operation
libs-team copied to clipboard

ACP: macro for UTF-16 literals

Open jmillikin opened this issue 1 year ago • 1 comments

Proposal

Problem statement

UTF-16 is a common text encoding in platforms that adopted Unicode before the invention of UTF-8. C compilers for these platforms have special syntax for UTF-16 string literals. It would be nice if Rust had similar functionality available via macro syntax.

Rust currently only supports UTF-8 literals, and converting them to UTF-16 requires either allocation or the use of a third-party macro crate. This is unergonomic and reduces Rust's competitiveness vs C/C++ in no_std contexts (i.e. DLLs).

Motivating examples or use cases

UTF-16 (or a superset) is the native encoding of Windows and Java, so code working with Win32 or JNI often involves string constants that need to eventually be UTF-16. Some platform interop crates include helper macros for UTF-16 literals, for example windows-sys provides windows_sys::core::w!().

Solution sketch

Define a core::str::utf16!() macro that receives a string literal and produces a [u16; N] containing the UTF-16 representation of its input.

use core::str::utf16;

// equivalent
const HELLO: [u16; 6] = utf16!("Hello!");
const HELLO: [u16; 6] = [0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x21];

// equivalent
const HELLO_REF: &[u16] = &utf16!("Hello!");
const HELLO_REF: &[u16] = &[0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x21];

I'm uncertain about whether a NUL-terminating variant is justified:

  • Several of the crates linked below include NUL-termination behavior, and Windows APIs that accept strings often expect a terminal NUL.
  • Adding a utf16_z!() or utf16_cstr!() macro seems a bit clunky.
  • Accepting utf16!(c"Hello!") has unclear semantics if the C string contains non-UTF8 content.
  • Accepting utf16!("Hello!"z) doesn't seem like a clear win vs utf16!("Hello!\0").

Alternatives

Continue using macros and/or const functions defined in third-party libraries.

Links and related work

A quick survey of crates that provide similar functionality in their public API (often their only purpose):

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

  • We think this problem seems worth solving, and the standard library might be the right place to solve it.
  • We think that this probably doesn't belong in the standard library.

Second, if there's a concrete solution:

  • We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
  • We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.

jmillikin avatar Apr 21 '24 13:04 jmillikin

We do have a couple of macros that are used internally. Note though that they are only intended for internal use.

ChrisDenton avatar Apr 21 '24 16:04 ChrisDenton

I feel like if we add a macro to create a string, we should also have one for CStr / c"lit" to be consistent. (There currently isn't any way to convert a regular Rust string literal to a CStr literal without proc macros)

tgross35 avatar May 21 '24 17:05 tgross35

I think the point of a macro is to workaround the lack of a string literal type. It doesn't make sense to have both, even if there's a practical reason they might not be exactly equivalent.

ChrisDenton avatar May 21 '24 18:05 ChrisDenton

@jmillikin for posterity, was there a discussion for rejecting this?

tgross35 avatar Jun 09 '24 20:06 tgross35

@tgross35 They closed all their RFCs and ACPs from what I've seen, so it's not related to this specific ACP.

CryZe avatar Jun 09 '24 21:06 CryZe

I'm just garbage-collecting stale issues / PRs. This suggestion wasn't rejected, and if someone else has time to drive it forward then feel free to re-file.

jmillikin avatar Jun 09 '24 22:06 jmillikin

Why did you consider this stale? It's only been open for two months. Were you expecting an earlier response?

pitaj avatar Jun 09 '24 22:06 pitaj

It's a pretty small ACP, so I figured if there was any interest at all from the libs-team folks then they would have said so.

I think two months is a reasonable timeout.

jmillikin avatar Jun 09 '24 22:06 jmillikin

There are a lot of ACPs and the queue is long. There are years-old ones still getting accepted. Example https://github.com/rust-lang/libs-team/issues/163

pitaj avatar Jun 10 '24 00:06 pitaj

As of today this repository has 95 open ACPs. Resolving all of them in two-months time (i.e. 8–9 weeks) would mean processing 10–12 proposals per weekly meeting on average. This does not seem to be a reasonable workload at all, clearly there is a bottleneck here.

Maybe the ISSUE_TEMPLATE can at least spell out the current throughput so contributors can have an expectation at most how long they will need to wait before first response. :thinking:

kennytm avatar Jun 10 '24 13:06 kennytm

Personally, I think it's fine to leave this to https://docs.rs/windows-core/latest/windows_core/macro.w.html and friends. Getting it from a windows crate for interop with windows apis and a JNI crate for interop with java seems quite reasonable, especially if there's no type needed. If different people end up using different macros for [u16]s, that seems completely fine, since they can still talk to each other.

(And often these don't even want UTF-16, they want some kind of YOLO-16 because of historical reasons where it's entirely normal to have unpaired surrogates or whatnot. Which is all the more reason to not have it in the standard library, IMHO.)

scottmcm avatar Jun 10 '24 16:06 scottmcm

(And often these don't even want UTF-16, they want some kind of YOLO-16 because of historical reasons where it's entirely normal to have unpaired surrogates or whatnot. Which is all the more reason to not have it in the standard library, IMHO.)

At least on Windows, it's entirely abnormal to have unpaired surrogates, to the point that many applications will, at best, be unable to e.g. open a file with such a corrupted file name. But if std did have a type (rather than just str -> [u16]) it'd be fine for it to have weaker guarantees then str.

ChrisDenton avatar Jun 10 '24 17:06 ChrisDenton