icu icon indicating copy to clipboard operation
icu copied to clipboard

ICU-22876 C++ UnicodeSet/USet easy item iteration

Open markusicu opened this issue 1 year ago • 2 comments

Modern C++ iteration over the code points, ranges, or strings in a C++ UnicodeSet or C USet.

Still draft-y. See NOTEs & TODOs.

  • UnicodeSet.ranges() returns a UnicodeSetRanges which has begin() & end() which return UnicodeSetRangeIterator which should work like an idiomatic C++ iterator.
  • Parallel for UnicodeSet.codePoints() & UnicodeSet.strings().
  • For iterating over all elements (code points & strings) in C++, by popular demand, I am adding begin() & end() directly to class UnicodeSet, similar to Java UnicodeSet which is an Iterable. The header-only wrapper still has separate range and iterator classes.
  • For strings, adding new C API to get strings without copying their contents.
  • Also, UnicodeSet.strings() wants to be called that in C++ just like in Java, and just like codePoints() & ranges(). That then required me to rename the field strings to strings_.
  • As discussed some time ago, this is written as a header-only API.
  • As such, I am defining a U_HEADER_NESTED_NAMESPACE and a U_HEADER_ONLY_NAMESPACE which put the new API into either icu::header when compiling user code or icu::internal when compiling ICU library code, so that even if we instantiate the API inside ICU it won't be the same as what a caller sees.
  • Operates on C API USet, so that C API users should be able to use it. C++ UnicodeSet.ranges() is a trivial convenience function.
  • TODO: Naming: For now, I am working with UnicodeSetXyz class prefixes. This might be a little confusing, since these classes actually operate on a USet, the C wrapper around a C++ UnicodeSet. I am thinking of renaming the type prefixes to just USetXyz. Opinions?
  • Open to name bike-shedding.

Sample real, working call sites:

    UnicodeSet set(UnicodeString(u"[abcçカ🚴]"sv), errorCode);
    for (auto [start, end] : set.ranges()) {
        printf("set.range U+%04lx..U+%04lx\n", (long)start, (long)end);
    }

    UnicodeSet set(UnicodeString(u"[abcçカ🚴{}{abc}{de}]"sv), errorCode);
    for (auto el : set) {
        std::string u8;
        printf("set.string length %ld \"%s\"\n", (long)el.length(), el.toUTF8String(u8).c_str());
    }

Output:

   utility {
      UnicodeSetTest {
         TestUnicodeSetRanges {
set.range U+0061..U+0063
set.range U+00e7..U+00e7
set.range U+30ab..U+30ab
set.range U+1f6b4..U+1f6b4
         } OK:   TestUnicodeSetRanges 
      } OK:   UnicodeSetTest 
   } OK:   utility 

   utility {
      UnicodeSetTest {
         TestUnicodeSetElements {
set.string length 1 "a"
set.string length 1 "b"
set.string length 1 "c"
set.string length 1 "ç"
set.string length 1 "カ"
set.string length 2 "🚴"
set.string length 0 ""
set.string length 3 "abc"
set.string length 2 "de"
         } OK:   TestUnicodeSetElements 
      } OK:   UnicodeSetTest 
   } OK:   utility 
Checklist
  • [x] Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22876
  • [x] Required: The PR title must be prefixed with a JIRA Issue number.
  • [x] Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • [x] Required: Each commit message must be prefixed with a JIRA Issue number.
  • [ ] Issue accepted (done by Technical Committee after discussion)
  • [ ] Tests included, if applicable
  • [ ] API docs and/or User Guide docs changed or added, if applicable

markusicu avatar Aug 28 '24 00:08 markusicu

Notice: the branch changed across the force-push!

  • icu4c/source/common/unicode/uset.h is different
  • icu4c/source/common/unicode/uversion.h is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Hi everyone:

  • I think I have a complete set of new APIs now for this purpose, including the popular "iterate over all elements" thing.
  • I have updated the PR description, please read.
  • As noted there, I am considering changing the prefixes of the new classes from UnicodeSetXyz to USetXyz, since they are really operating on a C USet. Ok? (One could counter-argue that in C++ we use UnicodeSet. Or that few people will actually type these class names.)
  • Please take a look at the sample code in usettest.cpp. (No real unit test code yet.) Looks nice and colloquial?
  • Please take a look at the API signatures.
  • Please take a look at the NOTEs and TODOs.
  • @eggrobin I would like to ask you in particular to look at the implementation code as well.

markusicu avatar Aug 30 '24 00:08 markusicu

@richgillam do you care whether the prefix of the new classes, which operate on USet but are in C++ and also created by UnicodeSet, is USetXyz or UnicodeSetXyz? For example,

  • USetCodePointIterator
  • UnicodeSetCodePointIterator

Right now, I am leaning towards the shorter version which seems slightly more accurate. I should really send the approval email today...

@aheninger opinion?

markusicu avatar Aug 30 '24 16:08 markusicu

I used Craig as a sounding board for the class prefix naming question. What tipped him over the top is that the new types are in uset.h, so he would expect them to start with "USet". I like that. Will rename.

markusicu avatar Aug 30 '24 19:08 markusicu

Renamed. This would also allow future additions of UnicodeSetXyz that don't go through the C API, if we ever wanted that.

markusicu avatar Aug 30 '24 20:08 markusicu

@richgillam do you care whether the prefix of the new classes, which operate on USet but are in C++ and also created by UnicodeSet, is USetXyz or UnicodeSetXyz? For example,

  • USetCodePointIterator
  • UnicodeSetCodePointIterator

Yes, I like USet here, for the reasons you outline.

richgillam avatar Aug 30 '24 21:08 richgillam

Wow, that's a lot of code to do something that seems like it should be fairly simple!

I agree! It's all because C++ didn't create a real iterator pattern and instead stuck with implementing iterators that look and quack like pointers.

markusicu avatar Aug 30 '24 22:08 markusicu

@eggrobin FYI commit ed5e8e30513b8d7d9071047e28681af8abb0ba57 addresses the ICU-TC API proposal review

markusicu avatar Sep 06 '24 21:09 markusicu

I think I am all done. It seems to all work, the API reflects what has been approved, with one minor exception noted in the PR description. API docs all there, although minimal for boilerplate C++ idioms. Tested via range-based for loops as well as explicit calls of each new function.

It might still make sense to review this one commit at a time. Once approved I will squash, then merge.

markusicu avatar Sep 09 '24 21:09 markusicu

Dear reviewers, it would be lovely to get this reviewed, feedback addressed, approved, and merged... fairly quickly, to unblock @roubert's follow-up work with the header-only collator predicates. I would be happy to jump into a video meeting to talk things through.

markusicu avatar Sep 10 '24 17:09 markusicu

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot