ICU-22876 C++ UnicodeSet/USet easy item iteration
Modern C++ iteration over the code points, ranges, or strings in a C++ UnicodeSet or C USet.
Still draft-y. See NOTEs & TODOs.
- UnicodeSet.ranges() returns a UnicodeSetRanges which has begin() & end() which return UnicodeSetRangeIterator which should work like an idiomatic C++ iterator.
- Parallel for UnicodeSet.codePoints() & UnicodeSet.strings().
- For iterating over all elements (code points & strings) in C++, by popular demand, I am adding begin() & end() directly to class UnicodeSet, similar to Java UnicodeSet which is an Iterable. The header-only wrapper still has separate range and iterator classes.
- For strings, adding new C API to get strings without copying their contents.
- Also, UnicodeSet.strings() wants to be called that in C++ just like in Java, and just like codePoints() & ranges(). That then required me to rename the field strings to strings_.
- As discussed some time ago, this is written as a header-only API.
- As such, I am defining a U_HEADER_NESTED_NAMESPACE and a U_HEADER_ONLY_NAMESPACE which put the new API into either icu::header when compiling user code or icu::internal when compiling ICU library code, so that even if we instantiate the API inside ICU it won't be the same as what a caller sees.
- Operates on C API USet, so that C API users should be able to use it. C++ UnicodeSet.ranges() is a trivial convenience function.
- TODO: Naming: For now, I am working with UnicodeSetXyz class prefixes. This might be a little confusing, since these classes actually operate on a USet, the C wrapper around a C++ UnicodeSet. I am thinking of renaming the type prefixes to just USetXyz. Opinions?
- Open to name bike-shedding.
Sample real, working call sites:
UnicodeSet set(UnicodeString(u"[abcçカ🚴]"sv), errorCode);
for (auto [start, end] : set.ranges()) {
printf("set.range U+%04lx..U+%04lx\n", (long)start, (long)end);
}
UnicodeSet set(UnicodeString(u"[abcçカ🚴{}{abc}{de}]"sv), errorCode);
for (auto el : set) {
std::string u8;
printf("set.string length %ld \"%s\"\n", (long)el.length(), el.toUTF8String(u8).c_str());
}
Output:
utility {
UnicodeSetTest {
TestUnicodeSetRanges {
set.range U+0061..U+0063
set.range U+00e7..U+00e7
set.range U+30ab..U+30ab
set.range U+1f6b4..U+1f6b4
} OK: TestUnicodeSetRanges
} OK: UnicodeSetTest
} OK: utility
utility {
UnicodeSetTest {
TestUnicodeSetElements {
set.string length 1 "a"
set.string length 1 "b"
set.string length 1 "c"
set.string length 1 "ç"
set.string length 1 "カ"
set.string length 2 "🚴"
set.string length 0 ""
set.string length 3 "abc"
set.string length 2 "de"
} OK: TestUnicodeSetElements
} OK: UnicodeSetTest
} OK: utility
Checklist
- [x] Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22876
- [x] Required: The PR title must be prefixed with a JIRA Issue number.
- [x] Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
- [x] Required: Each commit message must be prefixed with a JIRA Issue number.
- [ ] Issue accepted (done by Technical Committee after discussion)
- [ ] Tests included, if applicable
- [ ] API docs and/or User Guide docs changed or added, if applicable
Notice: the branch changed across the force-push!
- icu4c/source/common/unicode/uset.h is different
- icu4c/source/common/unicode/uversion.h is different
~ Your Friendly Jira-GitHub PR Checker Bot
Hi everyone:
- I think I have a complete set of new APIs now for this purpose, including the popular "iterate over all elements" thing.
- I have updated the PR description, please read.
- As noted there, I am considering changing the prefixes of the new classes from UnicodeSetXyz to USetXyz, since they are really operating on a C USet. Ok? (One could counter-argue that in C++ we use UnicodeSet. Or that few people will actually type these class names.)
- Please take a look at the sample code in usettest.cpp. (No real unit test code yet.) Looks nice and colloquial?
- Please take a look at the API signatures.
- Please take a look at the NOTEs and TODOs.
- @eggrobin I would like to ask you in particular to look at the implementation code as well.
@richgillam do you care whether the prefix of the new classes, which operate on USet but are in C++ and also created by UnicodeSet, is USetXyz or UnicodeSetXyz? For example,
- USetCodePointIterator
- UnicodeSetCodePointIterator
Right now, I am leaning towards the shorter version which seems slightly more accurate. I should really send the approval email today...
@aheninger opinion?
I used Craig as a sounding board for the class prefix naming question. What tipped him over the top is that the new types are in uset.h, so he would expect them to start with "USet". I like that. Will rename.
Renamed. This would also allow future additions of UnicodeSetXyz that don't go through the C API, if we ever wanted that.
@richgillam do you care whether the prefix of the new classes, which operate on USet but are in C++ and also created by UnicodeSet, is USetXyz or UnicodeSetXyz? For example,
- USetCodePointIterator
- UnicodeSetCodePointIterator
Yes, I like USet here, for the reasons you outline.
Wow, that's a lot of code to do something that seems like it should be fairly simple!
I agree! It's all because C++ didn't create a real iterator pattern and instead stuck with implementing iterators that look and quack like pointers.
@eggrobin FYI commit ed5e8e30513b8d7d9071047e28681af8abb0ba57 addresses the ICU-TC API proposal review
I think I am all done. It seems to all work, the API reflects what has been approved, with one minor exception noted in the PR description. API docs all there, although minimal for boilerplate C++ idioms. Tested via range-based for loops as well as explicit calls of each new function.
It might still make sense to review this one commit at a time. Once approved I will squash, then merge.
Dear reviewers, it would be lovely to get this reviewed, feedback addressed, approved, and merged... fairly quickly, to unblock @roubert's follow-up work with the header-only collator predicates. I would be happy to jump into a video meeting to talk things through.
Hooray! The files in the branch are the same across the force-push. 😃
~ Your Friendly Jira-GitHub PR Checker Bot