icu icon indicating copy to clipboard operation
icu copied to clipboard

ICU-22843 Simplify UTF-16 string literals

Open roubert opened this issue 1 year ago • 10 comments

This is a follow-up to PR #3076. There isn't really any advantage in wrapping these in string views, is there? Using plain string literals for UTF-16 strings just like already done for UTF-8 & ASCII strings makes the code simpler.

Checklist
  • [x] Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22843
  • [x] Required: The PR title must be prefixed with a JIRA Issue number.
  • [x] Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • [x] Required: Each commit message must be prefixed with a JIRA Issue number.
  • [x] Issue accepted (done by Technical Committee after discussion)
  • [ ] Tests included, if applicable
  • [ ] API docs and/or User Guide docs changed or added, if applicable

roubert avatar Aug 14 '24 10:08 roubert

I believe that this changes from compile-time u_strlen() to runtime u_strlen().

Why would it do that and why would UTF-16 string literals be different from all other string literals?

roubert avatar Aug 14 '24 19:08 roubert

Markus,

I believe that this changes from compile-time u_strlen() to runtime u_strlen().

I don't think so.

operator""sv is constexpr, but not consteval (so no guarantee that it must be evaluated at compile time), so I think we are just switching from calling the constexpr operator""sv to calling the constexpr constructor of u16string_view.

eggrobin avatar Aug 14 '24 20:08 eggrobin

(I am assuming here that we are still going through u16string_view via the new overload; if somehow this is going through a UnicodeString constructor, that would be another can of worms. But I think char16_t const* would satisfy the enable_if, meaning we would go through the new constructor.)

eggrobin avatar Aug 14 '24 20:08 eggrobin

Let's check what the compiler actually does. I just wrote this trivial piece of test code:

#include <string_view>

int do_something(std::u16string_view s);

int test1() {
  return do_something(u"foo");
}

int test2() {
  static constexpr char16_t TEXT[] = u"foobar";
  return do_something(TEXT);
}

Then compiled it like this:

$ clang++-18 -std=c++17 -Wall -O3 -S literal.cc

Which resulted in the following code, which is what I had expected (but presumably not what @markusicu had expected):

	.text
	.file	"literal.cc"
	.globl	_Z5test1v                       # -- Begin function _Z5test1v
	.p2align	4, 0x90
	.type	_Z5test1v,@function
_Z5test1v:                              # @_Z5test1v
	.cfi_startproc
# %bb.0:
	leaq	.L.str(%rip), %rsi
	movl	$3, %edi
	jmp	_Z12do_somethingSt17basic_string_viewIDsSt11char_traitsIDsEE@PLT # TAILCALL
.Lfunc_end0:
	.size	_Z5test1v, .Lfunc_end0-_Z5test1v
	.cfi_endproc
                                        # -- End function
	.globl	_Z5test2v                       # -- Begin function _Z5test2v
	.p2align	4, 0x90
	.type	_Z5test2v,@function
_Z5test2v:                              # @_Z5test2v
	.cfi_startproc
# %bb.0:
	leaq	_ZZ5test2vE4TEXT(%rip), %rsi
	movl	$6, %edi
	jmp	_Z12do_somethingSt17basic_string_viewIDsSt11char_traitsIDsEE@PLT # TAILCALL
.Lfunc_end1:
	.size	_Z5test2v, .Lfunc_end1-_Z5test2v
	.cfi_endproc
                                        # -- End function
	.type	.L.str,@object                  # @.str
	.section	.rodata.str2.2,"aMS",@progbits,2
	.p2align	1, 0x0
.L.str:
	.short	102                             # 0x66
	.short	111                             # 0x6f
	.short	111                             # 0x6f
	.short	0                               # 0x0
	.size	.L.str, 8

	.type	_ZZ5test2vE4TEXT,@object        # @_ZZ5test2vE4TEXT
	.section	.rodata,"a",@progbits
	.p2align	1, 0x0
_ZZ5test2vE4TEXT:
	.short	102                             # 0x66
	.short	111                             # 0x6f
	.short	111                             # 0x6f
	.short	98                              # 0x62
	.short	97                              # 0x61
	.short	114                             # 0x72
	.short	0                               # 0x0
	.size	_ZZ5test2vE4TEXT, 14

	.ident	"Debian clang version 18.1.8 (9)"
	.section	".note.GNU-stack","",@progbits
	.addrsig
	.addrsig_sym __gxx_personality_v0
	.addrsig_sym _ZZ5test2vE4TEXT

roubert avatar Aug 14 '24 20:08 roubert

I just wrote this trivial piece of test code:

But icu::UnicodeString is a lot more complex and has a frankly bewildering amount of overloaded constructors and no easily copy-pasteable test case can be constructed for that, but after taking a closer look at what the compiler really does with that code the (for me) unexpected and unwelcome result is that it'll prefer the const char16_t* constructor over the std::u16string_view constructor when being passed a UTF-16 string literal and this will indeed result in the known length of the string literal being thrown away and counted dynamically anew when the constructor just gets a pointer.

So this PR should not be merged in its current state, wrapping these string literals in string views really do have an advantage, forcing the better constructor to be selected,.

Considering that neither I nor @eggrobin realized this, I'd say that this ought to be changed, no documentation in the world will be able to ensure that all ICU4C developers will remember to use this workaround all the time.

I'll look into how the overloaded constructors might be fixed so that the correct one gets picked automatically by default.

roubert avatar Aug 14 '24 21:08 roubert

I am assuming here that we are still going through u16string_view via the new overload;

That was my assumption too, that seemed to be the logical thing to do, I didn't realize that the constructor also had a const char16_t* overload (in hindsight it's obvious that it exists for legacy reasons, but now having overloads for both pointers and views of the same underlying type is unexpected and messy).

roubert avatar Aug 14 '24 21:08 roubert

If there is an explicit overload for a const char16_t *, as in the constructor, then the compiler will prefer that as a perfect match.

If there is no raw-pointer overload, but one for S-convertible-to-string-view, then that's the best match for a raw pointer. Example: operator==()

Either of these overloads is a better match for a raw pointer than a const UnicodeString & overload, because the former don't require a conversion for overload selection. The latter requires a conversion; of course, that one does not work implicitly if the UnicodeString-from-raw-pointer is declared as explicit, which we recommend.

As a result, the simplest way to make sure that one avoids a u_strlen() as much as possible seems to be to consistently use a u"string view literal"sv where the API accepts that.

We can also add more documentation.

It might be possible to remove some of the raw-pointer overloads, but they are long stable, and so removing them seems wrong. We would have to be very sure that removing them does not break any more but the most unusual call sites.

(My recent new overloads broke the use case of a { pointer, length } initializer list, as you saw.)

markusicu avatar Aug 14 '24 23:08 markusicu

the simplest way to make sure that one avoids a u_strlen() as much as possible seems to be to consistently use a u"string view literal"sv where the API accepts that.

We can also add more documentation.

As Fredrik wrote, this is very unusual and thus counterintuitive to people familiar with modern C++; documentation won’t fix that.

It might be possible to remove some of the raw-pointer overloads, but they are long stable, and so removing them seems wrong.

We would have to be very sure that removing them does not break any more but the most unusual call sites.

We could replace them by a thing that takes anything that is convertible to a char16_t pointer but not to a u16string_view, which should catch the things that currently go through that overload but wouldn’t match the new ones.

eggrobin avatar Aug 15 '24 08:08 eggrobin

@markusicu:

(My recent new overloads broke the use case of a { pointer, length } initializer list, as you saw.)

Is that still broken after the switch to templatized overloads? I thought that was because of ambiguity between the std::u16string_view and UnicodeString constructors, but since now we require some actual type that is convertible to std::u16string_view, that should not match the 2-parameter std::u16string_view constructor.

eggrobin avatar Aug 15 '24 08:08 eggrobin

As Fredrik wrote, this is very unusual and thus counterintuitive to people familiar with modern C++; documentation won’t fix that.

As we're now at least two who share this opinion, I've just sent you all PR #3106 for review with a proposal for how the constructors could be changed in a way that won't require any code changes for existing users of the old constructors but will allow the compiler to create string views in fixed time from string literals whenever one is passed to the constructor, making expressions like UnicodeString(u"abc") do what I had assumed that it did when I sent this PR here for review with the "isn't really any advantage" comment.

roubert avatar Aug 15 '24 22:08 roubert

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@roubert ready to un-draft & merge?

markusicu avatar Sep 09 '24 22:09 markusicu

@roubert ready to un-draft & merge?

The CI is still broken, waiting on PR #3144.

roubert avatar Sep 10 '24 12:09 roubert

With PR #3106 and PR #3144 now both successfully merged, work can now proceed here. The changes in this PR now really do what I had thought that they would do when I originally wrote this PR.

roubert avatar Sep 10 '24 16:09 roubert