icu icon indicating copy to clipboard operation
icu copied to clipboard

ICU-22843 Enable constructing UnicodeString from literal in fixed time

Open roubert opened this issue 1 year ago • 9 comments

When passing a string literal to any of the legacy constructors that take just a plain pointer to a UTF-16 string it becomes necessary to iterate through the string to find its length, even though this length was known to the compiler (which just has no way of passing it on to the constructor).

But when calling the new templated string view constructor instead it becomes possible for the compiler to use the known length of a string literal to directly create a string view of the correct size and pass this on to the constructor.

By replacing the legacy constructors with the new constructor this is made the default behaviour.

Because non-templated functions take priority it's necessary to replace the legacy constructors or else the new constructor would never be called for any data type that can implicitly decay into a pointer (and this includes string literals). For the time being, this is gated on U_HIDE_DRAFT_API which will select either the old or the new constructors and even though the old constructors technically get deleted the new constructor is made possible to call in identical ways so that no changes to any call sites are necessary.

There is however a snag there and that's an old promise that nullptr is valid input and identical to the default constructor (ie. the empty string) but this is explicitly not valid input for string views, so this requires a backward compatibility shim to catch any nullptr before any attempt is made to use it to create a string view (which would crash most implementations of the standard library).

Checklist
  • [x] Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22843
  • [x] Required: The PR title must be prefixed with a JIRA Issue number.
  • [x] Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • [x] Required: Each commit message must be prefixed with a JIRA Issue number.
  • [x] Issue accepted (done by Technical Committee after discussion)
  • [ ] Tests included, if applicable
  • [ ] API docs and/or User Guide docs changed or added, if applicable

roubert avatar Aug 15 '24 21:08 roubert

Please take a look at this.

I have intentionally not done anything about the documentation yet, thinking that it'll be a better use of time to first make sure that we all agree that this change should be made at all, and if so, that the implementation is sound.

As can be seen directly in this PR no existing call sites should need to be changed, the updated new templated constructor will accept all existing callers with legacy behaviour kept intact, so I think that the change could be deemed acceptable for the ICU API stability promises. But it must be mentioned that the old constructors really do get removed so in case anyone anywhere has code that does something like taking the address of one of the old constructors then that code will cease to work after this change. As the change is gated on U_HIDE_DRAFT_API there should be ample time to find and address any such unusual usage, if any actually exists at all.

roubert avatar Aug 15 '24 22:08 roubert

Are you planning to do the same treatment for the other functions that have overloads for both raw-pointer and template-S?

I haven't made any such plans for I haven't really looked at those functions at all (so I don't know if there's anything that would need to be done there).

roubert avatar Aug 19 '24 15:08 roubert

Notice: the branch changed across the force-push!

  • icu4c/source/common/unicode/unistr.h is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

PS: Please don't squash until after approval, except when you need to rebase.

markusicu avatar Aug 26 '24 03:08 markusicu

@eggrobin, @richgillam as you were the reviewers of PR #3076, I'd like to know your opinion on these proposed changes here.

roubert avatar Aug 26 '24 13:08 roubert

I'd like to know your opinion on these proposed changes here.

I have opined. Apologies for the delayed review, I was on vacation last week. Other than the points discussed above, this seems like an improvement (and reduces the number of constructors, except for the draft transition period).

eggrobin avatar Aug 26 '24 14:08 eggrobin

Reminder...

Are you planning to do the same treatment for the other functions that have overloads for both raw-pointer and template-S?

I haven't made any such plans for I haven't really looked at those functions at all (so I don't know if there's anything that would need to be done there).

markusicu avatar Aug 29 '24 14:08 markusicu

Reminder...

I just think that it's a good idea to 1) make sure that we all agree that this should be done at all and 2) agree on how exactly to do it, before spending any time on doing any further changes.

roubert avatar Aug 29 '24 16:08 roubert

Suggestion:

  • drop the additional handling of uint16_t in char16ptr.h
  • keep the addition of toU16StringViewNullable
  • I don't feel strongly about keeping or reverting the changes from checking U_SIZEOF_WCHAR_T==2 in the template logic vs. in the preprocessor
  • keep the replacement of (fn(ptr) + fn(const S&) --> fn(const S &)+nullable

Note: I looked at the other overloads that I added, and AFAICT none of them has a sibling overload that just takes a pointer, so this might be it for now.

markusicu avatar Aug 29 '24 21:08 markusicu

@richgillam heads-up...

If there's any reason for why any of this wouldn't work for Apple, now would be a very good time to learn about that.

roubert avatar Aug 30 '24 16:08 roubert

Are you planning to do the same treatment for the other functions that have overloads for both raw-pointer and template-S?

I've now searched through the entire file without finding any such functions.

roubert avatar Sep 03 '24 14:09 roubert

OK, it seems like we've reached some kind of silent consensus here, so I've now updated the documentation comments and made this PR ready to merge (if the proposed API change gets approved by the TC).

roubert avatar Sep 03 '24 14:09 roubert

Notice: the branch changed across the force-push!

  • icu4c/source/common/unicode/char16ptr.h is different
  • icu4c/source/common/unicode/unistr.h is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@richgillam heads-up...

If there's any reason for why any of this wouldn't work for Apple, now would be a very good time to learn about that.

See my comment above. Apple does still define UChar as uint16_t. I tried taking that out a while ago and got all kinds of bug reports. But it sounds like maybe Robin's objections are theoretical, at least in Apple's case, so maybe we leave what Fredrik is doing the way it is?

richgillam avatar Sep 10 '24 22:09 richgillam

See my comment above. Apple does still define UChar as uint16_t. I tried taking that out a while ago and got all kinds of bug reports. But it sounds like maybe Robin's objections are theoretical, at least in Apple's case, so maybe we leave what Fredrik is doing the way it is?

...And now I get to the bottom and am reminded that this was already merged. I lost track-- did we decide to leave the uint_16 stuff in there?

richgillam avatar Sep 10 '24 22:09 richgillam

I lost track-- did we decide to leave the uint_16 stuff in there?

It's in, but https://github.com/unicode-org/icu/pull/3144 makes it conditional, removing it starting with libc++ 18. Making a string_view over uint16_t generates warnings with libc++ 18 and no longer works at all with libc++ 19.

It should continue to work if you use a standard library that still supports std::basic_string_view<uint16_t> -- but the standard does not condone it.

markusicu avatar Sep 10 '24 23:09 markusicu

I lost track-- did we decide to leave the uint_16 stuff in there?

It's in, but #3144 makes it conditional, removing it starting with libc++ 18. Making a string_view over uint16_t generates warnings with libc++ 18 and no longer works at all with libc++ 19.

It should continue to work if you use a standard library that still supports std::basic_string_view<uint16_t> -- but the standard does not condone it.

Okay. Sounds like it's time for me to figure out how to get everybody to change to using char16_t...

richgillam avatar Sep 11 '24 00:09 richgillam