ICU-13219 add -u-dx support to BreakIterator
Checklist
- [X] Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-13219
- [X] Required: The PR title must be prefixed with a JIRA Issue number.
- [X] Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
- [X] Required: Each commit message must be prefixed with a JIRA Issue number.
- [X] Issue accepted (done by Technical Committee after discussion)
- [X] Tests included, if applicable
- [ ] API docs and/or User Guide docs changed or added, if applicable
@srl295 @eggrobin could you look at the unit test and see does that fit what you understand about DX ?
@srl295 @eggrobin could you look at the unit test and see does that fit what you understand about DX ?
Suggestion: change the title to -u-dx
I will have to look at the test cases a bit more but it seems like it could work.
Did you see my pr https://github.com/unicode-org/icu/pull/2676 which has a test case from a minority language?
Did you see my pr #2676 which has a test case from a minority language?
The tricky part will not be the behavior of break within a script, but in the boundary with another characters or between two script inside a -u-dx. For example, let's say we have -u-dx-thai-laoo and we have a run of text in thai and lao script and number without any spaces, would there any break in that run of text? or shoudl it beak in the boundary with number, or break in in the spot between the lao script and the thai script? or none at a all.
Notice: the branch changed across the force-push!
- icu4c/source/common/brkiter.cpp is different
- icu4c/source/common/rbbi_cache.cpp is different
- icu4c/source/common/rbbi.cpp is different
- icu4c/source/common/unicode/rbbi.h is different
- icu4j/main/core/src/main/java/com/ibm/icu/text/BreakIterator.java is now changed in the branch
- icu4j/main/core/src/main/java/com/ibm/icu/text/BreakIteratorFactory.java is now changed in the branch
- icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is now changed in the branch
- icu4j/main/core/src/test/java/com/ibm/icu/dev/test/rbbi/RBBITest.java is now changed in the branch
~ Your Friendly Jira-GitHub PR Checker Bot
Notice: the branch changed across the force-push!
- icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is different
~ Your Friendly Jira-GitHub PR Checker Bot
Notice: the branch changed across the force-push!
- icu4c/source/common/brkiter.cpp is different
- icu4c/source/common/unicode/brkiter.h is now changed in the branch
~ Your Friendly Jira-GitHub PR Checker Bot
Notice: the branch changed across the force-push!
- icu4c/source/common/rbbi_cache.cpp is different
- icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is different
~ Your Friendly Jira-GitHub PR Checker Bot
Did you see my pr #2676 which has a test case from a minority language?
I try your proposed diff of icu4c/source/test/testdata/rbbitst.txt and below is the error I got in my PR. Is your expectation "correct"?
=== Handling test: rbbi/RBBITest/TestExtended: ===
rbbi {
RBBITest {
TestExtended {
code alpha extend alphanum type word sent line name
------------------------------------------------ 0
e42 1 0 1 Lo XX LE SA THAI CHARACTER SARA O
e2d 1 0 1 Lo XX LE SA THAI CHARACTER O ANG
e4d 1 1 0 Mn Extend EX SA THAI CHARACTER NIKHAHIT
e19 1 0 1 Lo XX LE SA THAI CHARACTER NO NU
------------------------------------------------ 4
20 0 0 0 Zs WSegSpace SP SP SPACE
e2d 1 0 1 Lo XX LE SA THAI CHARACTER O ANG
e30 1 0 1 Lo XX LE SA THAI CHARACTER SARA A
e44 1 0 1 Lo XX LE SA THAI CHARACTER SARA AI MAIMALAI
e1b 1 0 1 Lo XX LE SA THAI CHARACTER PO PLA
20 0 0 0 Zs WSegSpace SP SP SPACE
e08 1 0 1 Lo XX LE SA THAI CHARACTER CHO CHAN
e39 1 1 0 Mn Extend EX SA THAI CHARACTER SARA UU
e48 0 1 0 Mn Extend EX SA THAI CHARACTER MAI EK
e27 1 0 1 Lo XX LE SA THAI CHARACTER WO WAEN
e32 1 0 1 Lo XX LE SA THAI CHARACTER SARA AA
e21 1 0 1 Lo XX LE SA THAI CHARACTER MO MA
20 0 0 0 Zs WSegSpace SP SP SPACE
e42 1 0 1 Lo XX LE SA THAI CHARACTER SARA O
e25 1 0 1 Lo XX LE SA THAI CHARACTER LO LING
e48 0 1 0 Mn Extend EX SA THAI CHARACTER MAI EK
e19 1 0 1 Lo XX LE SA THAI CHARACTER NO NU
Forward Iteration, break expected, but not found. Pos= 4 File line,col= 1538, 13
code alpha extend alphanum type word sent line name
------------------------------------------------ 0
e42 1 0 1 Lo XX LE SA THAI CHARACTER SARA O
e2d 1 0 1 Lo XX LE SA THAI CHARACTER O ANG
e4d 1 1 0 Mn Extend EX SA THAI CHARACTER NIKHAHIT
e19 1 0 1 Lo XX LE SA THAI CHARACTER NO NU
20 0 0 0 Zs WSegSpace SP SP SPACE
------------------------------------------------ 5
e2d 1 0 1 Lo XX LE SA THAI CHARACTER O ANG
e30 1 0 1 Lo XX LE SA THAI CHARACTER SARA A
e44 1 0 1 Lo XX LE SA THAI CHARACTER SARA AI MAIMALAI
e1b 1 0 1 Lo XX LE SA THAI CHARACTER PO PLA
20 0 0 0 Zs WSegSpace SP SP SPACE
e08 1 0 1 Lo XX LE SA THAI CHARACTER CHO CHAN
e39 1 1 0 Mn Extend EX SA THAI CHARACTER SARA UU
e48 0 1 0 Mn Extend EX SA THAI CHARACTER MAI EK
e27 1 0 1 Lo XX LE SA THAI CHARACTER WO WAEN
e32 1 0 1 Lo XX LE SA THAI CHARACTER SARA AA
e21 1 0 1 Lo XX LE SA THAI CHARACTER MO MA
20 0 0 0 Zs WSegSpace SP SP SPACE
e42 1 0 1 Lo XX LE SA THAI CHARACTER SARA O
e25 1 0 1 Lo XX LE SA THAI CHARACTER LO LING
e48 0 1 0 Mn Extend EX SA THAI CHARACTER MAI EK
e19 1 0 1 Lo XX LE SA THAI CHARACTER NO NU
Forward Iteration, break found, but not expected. Pos= 5 File line,col= 1538, 15
Reverse Itertion, break found, but not expected. Pos= 5 File line,col= 1538, 15
Reverse Iteration, break expected, but not found. Pos= 4 File line,col= 1538, 13
isBoundary(4) incorrect. File line,col= 1538, 13
Expected, Actual= true, false
isBoundary(5) incorrect. File line,col= 1538, 15
Expected, Actual= false, true
following(0) incorrect. File line,col= 1538, 8
Expected, Actual= 4, 5
following(1) incorrect. File line,col= 1538, 10
Expected, Actual= 4, 5
following(2) incorrect. File line,col= 1538, 11
Expected, Actual= 4, 5
following(3) incorrect. File line,col= 1538, 12
Expected, Actual= 4, 5
following(4) incorrect. File line,col= 1538, 13
Expected, Actual= 10, 5
preceding(10) incorrect. File line,col= 1538, 20
Expected, Actual= 4, 5
preceding(9) incorrect. File line,col= 1538, 19
Expected, Actual= 4, 5
preceding(8) incorrect. File line,col= 1538, 18
Expected, Actual= 4, 5
preceding(7) incorrect. File line,col= 1538, 17
Expected, Actual= 4, 5
preceding(6) incorrect. File line,col= 1538, 16
Expected, Actual= 4, 5
preceding(5) incorrect. File line,col= 1538, 15
Expected, Actual= 4, 0
code alpha extend alphanum type word sent line name
------------------------------------------------ 0
e42 1 0 1 Lo XX LE SA THAI CHARACTER SARA O
e2d 1 0 1 Lo XX LE SA THAI CHARACTER O ANG
e4d 1 1 0 Mn Extend EX SA THAI CHARACTER NIKHAHIT
e19 1 0 1 Lo XX LE SA THAI CHARACTER NO NU
20 0 0 0 Zs WSegSpace SP SP SPACE
e2d 1 0 1 Lo XX LE SA THAI CHARACTER O ANG
e30 1 0 1 Lo XX LE SA THAI CHARACTER SARA A
e44 1 0 1 Lo XX LE SA THAI CHARACTER SARA AI MAIMALAI
e1b 1 0 1 Lo XX LE SA THAI CHARACTER PO PLA
20 0 0 0 Zs WSegSpace SP SP SPACE
e08 1 0 1 Lo XX LE SA THAI CHARACTER CHO CHAN
e39 1 1 0 Mn Extend EX SA THAI CHARACTER SARA UU
------------------------------------------------ 12
e48 0 1 0 Mn Extend EX SA THAI CHARACTER MAI EK
e27 1 0 1 Lo XX LE SA THAI CHARACTER WO WAEN
e32 1 0 1 Lo XX LE SA THAI CHARACTER SARA AA
e21 1 0 1 Lo XX LE SA THAI CHARACTER MO MA
20 0 0 0 Zs WSegSpace SP SP SPACE
e42 1 0 1 Lo XX LE SA THAI CHARACTER SARA O
e25 1 0 1 Lo XX LE SA THAI CHARACTER LO LING
e48 0 1 0 Mn Extend EX SA THAI CHARACTER MAI EK
e19 1 0 1 Lo XX LE SA THAI CHARACTER NO NU
Forward Iteration, break expected, but not found. Pos= 12 File line,col= 1538, 13
code alpha extend alphanum type word sent line name
------------------------------------------------ 0
e42 1 0 1 Lo XX LE SA THAI CHARACTER SARA O
e2d 1 0 1 Lo XX LE SA THAI CHARACTER O ANG
e4d 1 1 0 Mn Extend EX SA THAI CHARACTER NIKHAHIT
e19 1 0 1 Lo XX LE SA THAI CHARACTER NO NU
20 0 0 0 Zs WSegSpace SP SP SPACE
------------------------------------------------ 13
e2d 1 0 1 Lo XX LE SA THAI CHARACTER O ANG
e30 1 0 1 Lo XX LE SA THAI CHARACTER SARA A
e44 1 0 1 Lo XX LE SA THAI CHARACTER SARA AI MAIMALAI
e1b 1 0 1 Lo XX LE SA THAI CHARACTER PO PLA
20 0 0 0 Zs WSegSpace SP SP SPACE
e08 1 0 1 Lo XX LE SA THAI CHARACTER CHO CHAN
e39 1 1 0 Mn Extend EX SA THAI CHARACTER SARA UU
e48 0 1 0 Mn Extend EX SA THAI CHARACTER MAI EK
e27 1 0 1 Lo XX LE SA THAI CHARACTER WO WAEN
e32 1 0 1 Lo XX LE SA THAI CHARACTER SARA AA
e21 1 0 1 Lo XX LE SA THAI CHARACTER MO MA
20 0 0 0 Zs WSegSpace SP SP SPACE
e42 1 0 1 Lo XX LE SA THAI CHARACTER SARA O
e25 1 0 1 Lo XX LE SA THAI CHARACTER LO LING
e48 0 1 0 Mn Extend EX SA THAI CHARACTER MAI EK
e19 1 0 1 Lo XX LE SA THAI CHARACTER NO NU
Forward Iteration, break found, but not expected. Pos= 13 File line,col= 1538, 15
Reverse Itertion, break found, but not expected. Pos= 13 File line,col= 1538, 15
Reverse Iteration, break expected, but not found. Pos= 12 File line,col= 1538, 13
isBoundary(12) incorrect. File line,col= 1538, 13
Expected, Actual= true, false
isBoundary(13) incorrect. File line,col= 1538, 15
Expected, Actual= false, true
following(0) incorrect. File line,col= 1538, 8
Expected, Actual= 12, 13
following(1) incorrect. File line,col= 1538, 8
Expected, Actual= 12, 13
following(2) incorrect. File line,col= 1538, 8
Expected, Actual= 12, 13
following(3) incorrect. File line,col= 1538, 10
Expected, Actual= 12, 13
following(4) incorrect. File line,col= 1538, 10
Expected, Actual= 12, 13
following(5) incorrect. File line,col= 1538, 10
Expected, Actual= 12, 13
following(6) incorrect. File line,col= 1538, 11
Expected, Actual= 12, 13
following(7) incorrect. File line,col= 1538, 11
Expected, Actual= 12, 13
following(8) incorrect. File line,col= 1538, 11
Expected, Actual= 12, 13
following(9) incorrect. File line,col= 1538, 12
Expected, Actual= 12, 13
following(10) incorrect. File line,col= 1538, 12
Expected, Actual= 12, 13
following(11) incorrect. File line,col= 1538, 12
Expected, Actual= 12, 13
following(12) incorrect. File line,col= 1538, 13
Expected, Actual= 26, 13
preceding(28) incorrect. File line,col= 1538, 20
Expected, Actual= 12, 13
preceding(27) incorrect. File line,col= 1538, 20
Expected, Actual= 12, 13
preceding(26) incorrect. File line,col= 1538, 20
Expected, Actual= 12, 13
preceding(25) incorrect. File line,col= 1538, 19
Expected, Actual= 12, 13
preceding(24) incorrect. File line,col= 1538, 18
Expected, Actual= 12, 13
preceding(23) incorrect. File line,col= 1538, 18
Expected, Actual= 12, 13
preceding(22) incorrect. File line,col= 1538, 18
Expected, Actual= 12, 13
preceding(21) incorrect. File line,col= 1538, 17
Expected, Actual= 12, 13
preceding(20) incorrect. File line,col= 1538, 17
Expected, Actual= 12, 13
preceding(19) incorrect. File line,col= 1538, 17
Expected, Actual= 12, 13
preceding(18) incorrect. File line,col= 1538, 16
Expected, Actual= 12, 13
preceding(17) incorrect. File line,col= 1538, 16
Expected, Actual= 12, 13
preceding(16) incorrect. File line,col= 1538, 16
Expected, Actual= 12, 13
preceding(15) incorrect. File line,col= 1538, 15
Expected, Actual= 12, 0
preceding(14) incorrect. File line,col= 1538, 15
Expected, Actual= 12, 0
preceding(13) incorrect. File line,col= 1538, 15
Expected, Actual= 12, 0
} ERRORS (52) in TestExtended (38ms)
} ERRORS (52) in RBBITest (38ms)
} ERRORS (52) in rbbi (38ms)
--------------------------------------
Errors in total: 52.
TestExtended
RBBITest
rbbi
--------------------------------------
You diff actually only add one test case
<data>•โอํน• อะไป •จู่วาม •โล่น•</data>
for line break but I think it is incorrect why should the line break happen before the space ? It should be
<data>•โอํน •อะไป •จู่วาม •โล่น•</data>
instead, right?
Looking https://github.com/unicode-org/icu/pull/2676/files#diff-b177067bbc1df57fc40ae7629a81e8df960899b9088555b010680a1c500943e2 Also, the line
<line>
# Should no longer break at the dictionary points - it's not Thai language
...
#<data>•โอํน• •อะไป• •จู่วาม• •โล่น• •เปี่ยร• •อะลู่วาง• •แมะ,• •ปาย• •อัน• •แบ็จ• •อะโจํน• •ซา• •เมาะ.• •อัน• •ฮะบืน• •ตะ• •เวี่ยะ• •ตะ• •งี่ยาน,• •อัน• •ฮะบืน• •อีว• •อะปายฮ.•</data>
should be
<line>
# Should no longer break at the dictionary points - it's not Thai language
...
<data>•โอํน •อะไป •จู่วาม •โล่น •เปี่ยร •อะลู่วาง •แมะ, •ปาย •อัน •แบ็จ •อะโจํน •ซา •เมาะ. •อัน •ฮะบืน •ตะ •เวี่ยะ •ตะ •งี่ยาน, •อัน •ฮะบืน •อีว •อะปายฮ.•</data>
there are no reason to have a line break before the space. Line break should only happen after the SPACE not before the SPACE, right?
Notice: the branch changed across the force-push!
- icu4c/source/test/testdata/rbbitst.txt is now changed in the branch
- icu4j/main/core/src/test/resources/com/ibm/icu/dev/test/rbbi/rbbitst.txt is now changed in the branch
~ Your Friendly Jira-GitHub PR Checker Bot
@srl295 I copy your test change over but change it. Please read my modified version in this PR and see do you agree with that. The change are
- for line break, there should have no line break before the SPACE
- for word break, the status should be 200 not 0
- for word break, we should break beefore . and , if we treat the Thai as AL.
- not using dx=zyyyy . That part of spec is very bad. I file bug https://unicode-org.atlassian.net/browse/CLDR-17247 for that. I do not think we should implement that behavior. It is clearly a spec bug from my point of view.
Notice: the branch changed across the force-push!
- icu4c/source/common/rbbi.cpp is different
- icu4c/source/test/testdata/rbbitst.txt is different
- icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is different
- icu4j/main/core/src/test/resources/com/ibm/icu/dev/test/rbbi/rbbitst.txt is different
~ Your Friendly Jira-GitHub PR Checker Bot
Notice: the branch changed across the force-push!
- icu4c/source/common/rbbi.cpp is different
- icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is different
~ Your Friendly Jira-GitHub PR Checker Bot
Notice: the branch changed across the force-push!
- icu4c/source/common/brkiter.cpp is different
- icu4c/source/common/rbbi.cpp is different
~ Your Friendly Jira-GitHub PR Checker Bot
Let me try to be clearer.
Suppose that
- The dictionary breakIterator will act on any characters marked A or B below, but will skip over C and D.
- The dx vales need to cause the characters B and C to be skipped, but has no effect on characters A and D.
AAABBBCCCDDD
What should happen is that the dictionary break iterator should act on the characters AAA, and otherwise the RBNF rules will act on BBBCCCDDD.
From what I see of your code change, at the first A character, the dictionary breakIterator accepts it, and dx doesn't exclude it. So the dictionary's break iterator gets called. That seems clear.
What is not clear to me is how lbe.findBreaks knows to stop at the first B, because the break iterator internally has no access to the dx exclusion set, and there isn't any other change in your PR that would indicate some way that the iterator's results past the first B would be ignored.
Let me try to be clearer.
Suppose that
- The dictionary breakIterator will act on any characters marked A or B below, but will skip over C and D.
- The dx vales need to cause the characters B and C to be skipped, but has no effect on characters A and D.
AAABBBCCCDDD
What should happen is that the dictionary break iterator should act on the characters AAA, and otherwise the RBNF rules will act on BBBCCCDDD.
From what I see of your code change, at the first A character, the dictionary breakIterator accepts it, and dx doesn't exclude it. So the dictionary's break iterator gets called. That seems clear.
What is not clear to me is how lbe.findBreaks knows to stop at the first B, because the break iterator internally has no access to the dx exclusion set, and there isn't any other change in your PR that would indicate some way that the iterator's results past the first B would be ignored.
I see. ok, you are right, that is not clear. I need to call excludedFromDictionaryBreak to adjust the startRange and endRange before passing to the findBreaks. the startRange and endRange pass to the findBreaks may need to be changed to a different values excluding these characters.
@FrankYFTang my apologies, I have not found spare time to review this recently. i will let @mhosken in case he's able to review.
it seems it's going in a good direction… certainly feel free to amend my test as as needed (it was meant as an example not as proscriptive) and close the other PR…
Np, your feedback is great for catching problems!
On Wed, Dec 6, 2023 at 1:16 PM Frank Yung-Fong Tang < @.***> wrote:
@.**** commented on this pull request.
In icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java https://github.com/unicode-org/icu/pull/2702#discussion_r1417986914:
String dxs) {
if (dxs == null) {return null;}if (dxs.length() % 5 != 4) {throw new IllegalArgumentException("Incorrect value for dx key: " + dxs);}// Change from "thai" to "[[:scx=thai:]]" or "thai-arab" to "[[:scx=thai:][:scx=arab:]]"StringBuilder builder = new StringBuilder("[");int items = 1 + (dxs.length() / 5);for (int i = 0; i < items; i++) {if (i > 0 && dxs.charAt(i*5-1) != '-') {throw new IllegalArgumentException("Incorrect value for dx key: " + dxs);}String script = dxs.substring(i*5, i*5+4);// Special handling of zyyyoh. Just saw what you landed in https://github.com/unicode-org/cldr/pull/3411/files that make sense. Sorry. Ignore my previous comments.
— Reply to this email directly, view it on GitHub https://github.com/unicode-org/icu/pull/2702#discussion_r1417986914, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMFIZ2UB57SAEBL2FWDYIDOEBAVCNFSM6AAAAAA7LXMYCKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTONRYGYZTKMRTGM . You are receiving this because you were assigned.Message ID: @.***>
Please ignore my update. I am still working on this PR. It is not ready for review. After Mark point out some issue, I found my design was wrong and need a more intensify rework.
Notice: the branch changed across the force-push!
- icu4c/source/common/rbbi.cpp is different
- icu4c/source/test/intltest/rbbitst.cpp is different
- icu4c/source/test/intltest/rbbitst.h is different
- icu4j/main/core/src/main/java/com/ibm/icu/impl/breakiter/CjkBreakEngine.java is now changed in the branch
- icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is different
- icu4j/main/core/src/test/java/com/ibm/icu/dev/test/rbbi/RBBITest.java is different
~ Your Friendly Jira-GitHub PR Checker Bot
I'm getting user complaints again on this. Can we action this. Some fix for disabling dictionary breaking has been requested since, well I can't find out since I can't get to the old bug tracker, but it came into the latest tracker in 2019.
Perhaps the best is the enemy of the good here? The only people, that I know of, that are affected by this are those using minority languages, are inserting ZWSP for word breaks and are dealing with correctly tagged text. Do we have to refine this fix for the non use cases as well, before we can fix for the actual use case?
I'm sorry that my frustration is showing. But we seem to be more concerned about people who do the wrong thing than those who do the right thing (and tag correctly, by some definition). The really correct solution is that if the text is not tagged with the language of the dictionary, then no dictionary breaking should occur. I realise that that is just too much for most people and so we have special tagging. But can we please get something out for these users who are able to tag correctly?
Please shipit already.
@FrankYFTang is this going to be merged for 75?
no, the issue is more complicated than my PR did.
no, the issue is more complicated than my PR did.
Do you have more detail?
no, the issue is more complicated than my PR did.
Do you have more detail?
It require more detail analysis and testings for different cases than what I put into this PR. I missed some complicated combination.
no, the issue is more complicated than my PR did.
Do you have more detail?
It require more detail analysis and testings for different cases than what I put into this PR. I missed some complicated combination.
Frank, It's opt-in. COuld we consider moving forward with this even if it needs additional work? It shouldn't be a problem for users that don't set -u-dx
sorry, I need to pick up the work again.