icu ICU-13219 add -u-dx support to BreakIterator

Checklist

[X] Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-13219
[X] Required: The PR title must be prefixed with a JIRA Issue number.
[X] Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
[X] Required: Each commit message must be prefixed with a JIRA Issue number.
[X] Issue accepted (done by Technical Committee after discussion)
[X] Tests included, if applicable
[ ] API docs and/or User Guide docs changed or added, if applicable

Nov 15 '23 01:11 FrankYFTang

@srl295 @eggrobin could you look at the unit test and see does that fit what you understand about DX ?

Nov 15 '23 01:11 FrankYFTang

@srl295 @eggrobin could you look at the unit test and see does that fit what you understand about DX ?

Suggestion: change the title to -u-dx

I will have to look at the test cases a bit more but it seems like it could work.

Did you see my pr https://github.com/unicode-org/icu/pull/2676 which has a test case from a minority language?

Nov 15 '23 06:11 srl295

Did you see my pr #2676 which has a test case from a minority language?

The tricky part will not be the behavior of break within a script, but in the boundary with another characters or between two script inside a -u-dx. For example, let's say we have -u-dx-thai-laoo and we have a run of text in thai and lao script and number without any spaces, would there any break in that run of text? or shoudl it beak in the boundary with number, or break in in the spot between the lao script and the thai script? or none at a all.

Nov 15 '23 19:11 FrankYFTang

Notice: the branch changed across the force-push!

icu4c/source/common/brkiter.cpp is different
icu4c/source/common/rbbi_cache.cpp is different
icu4c/source/common/rbbi.cpp is different
icu4c/source/common/unicode/rbbi.h is different
icu4j/main/core/src/main/java/com/ibm/icu/text/BreakIterator.java is now changed in the branch
icu4j/main/core/src/main/java/com/ibm/icu/text/BreakIteratorFactory.java is now changed in the branch
icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is now changed in the branch
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/rbbi/RBBITest.java is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Nov 16 '23 01:11 jira-pull-request-webhook[bot]

Notice: the branch changed across the force-push!

icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Nov 16 '23 02:11 jira-pull-request-webhook[bot]

Notice: the branch changed across the force-push!

icu4c/source/common/brkiter.cpp is different
icu4c/source/common/unicode/brkiter.h is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Nov 16 '23 02:11 jira-pull-request-webhook[bot]

Notice: the branch changed across the force-push!

icu4c/source/common/rbbi_cache.cpp is different
icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Nov 16 '23 03:11 jira-pull-request-webhook[bot]

Did you see my pr #2676 which has a test case from a minority language?

I try your proposed diff of icu4c/source/test/testdata/rbbitst.txt and below is the error I got in my PR. Is your expectation "correct"?

=== Handling test: rbbi/RBBITest/TestExtended: ===
   rbbi {
      RBBITest {
         TestExtended {
code    alpha extend alphanum type word sent line name
------------------------------------------------ 0
    e42     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA O
    e2d     1      0        1   Lo   XX   LE   SA THAI CHARACTER O ANG
    e4d     1      1        0   Mn Extend   EX   SA THAI CHARACTER NIKHAHIT
    e19     1      0        1   Lo   XX   LE   SA THAI CHARACTER NO NU
------------------------------------------------ 4
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e2d     1      0        1   Lo   XX   LE   SA THAI CHARACTER O ANG
    e30     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA A
    e44     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA AI MAIMALAI
    e1b     1      0        1   Lo   XX   LE   SA THAI CHARACTER PO PLA
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e08     1      0        1   Lo   XX   LE   SA THAI CHARACTER CHO CHAN
    e39     1      1        0   Mn Extend   EX   SA THAI CHARACTER SARA UU
    e48     0      1        0   Mn Extend   EX   SA THAI CHARACTER MAI EK
    e27     1      0        1   Lo   XX   LE   SA THAI CHARACTER WO WAEN
    e32     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA AA
    e21     1      0        1   Lo   XX   LE   SA THAI CHARACTER MO MA
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e42     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA O
    e25     1      0        1   Lo   XX   LE   SA THAI CHARACTER LO LING
    e48     0      1        0   Mn Extend   EX   SA THAI CHARACTER MAI EK
    e19     1      0        1   Lo   XX   LE   SA THAI CHARACTER NO NU
         Forward Iteration, break expected, but not found.  Pos=   4  File line,col= 1538,  13
code    alpha extend alphanum type word sent line name
------------------------------------------------ 0
    e42     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA O
    e2d     1      0        1   Lo   XX   LE   SA THAI CHARACTER O ANG
    e4d     1      1        0   Mn Extend   EX   SA THAI CHARACTER NIKHAHIT
    e19     1      0        1   Lo   XX   LE   SA THAI CHARACTER NO NU
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
------------------------------------------------ 5
    e2d     1      0        1   Lo   XX   LE   SA THAI CHARACTER O ANG
    e30     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA A
    e44     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA AI MAIMALAI
    e1b     1      0        1   Lo   XX   LE   SA THAI CHARACTER PO PLA
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e08     1      0        1   Lo   XX   LE   SA THAI CHARACTER CHO CHAN
    e39     1      1        0   Mn Extend   EX   SA THAI CHARACTER SARA UU
    e48     0      1        0   Mn Extend   EX   SA THAI CHARACTER MAI EK
    e27     1      0        1   Lo   XX   LE   SA THAI CHARACTER WO WAEN
    e32     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA AA
    e21     1      0        1   Lo   XX   LE   SA THAI CHARACTER MO MA
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e42     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA O
    e25     1      0        1   Lo   XX   LE   SA THAI CHARACTER LO LING
    e48     0      1        0   Mn Extend   EX   SA THAI CHARACTER MAI EK
    e19     1      0        1   Lo   XX   LE   SA THAI CHARACTER NO NU
         Forward Iteration, break found, but not expected.  Pos=   5  File line,col= 1538,  15
         Reverse Itertion, break found, but not expected.  Pos=   5  File line,col= 1538,  15
         Reverse Iteration, break expected, but not found.  Pos=   4  File line,col= 1538,  13
         isBoundary(4) incorrect. File line,col= 1538,  13
                 Expected, Actual= true, false
         isBoundary(5) incorrect. File line,col= 1538,  15
                 Expected, Actual= false, true
         following(0) incorrect. File line,col= 1538,   8
                 Expected, Actual= 4, 5
         following(1) incorrect. File line,col= 1538,  10
                 Expected, Actual= 4, 5
         following(2) incorrect. File line,col= 1538,  11
                 Expected, Actual= 4, 5
         following(3) incorrect. File line,col= 1538,  12
                 Expected, Actual= 4, 5
         following(4) incorrect. File line,col= 1538,  13
                 Expected, Actual= 10, 5
         preceding(10) incorrect. File line,col= 1538,  20
                 Expected, Actual= 4, 5
         preceding(9) incorrect. File line,col= 1538,  19
                 Expected, Actual= 4, 5
         preceding(8) incorrect. File line,col= 1538,  18
                 Expected, Actual= 4, 5
         preceding(7) incorrect. File line,col= 1538,  17
                 Expected, Actual= 4, 5
         preceding(6) incorrect. File line,col= 1538,  16
                 Expected, Actual= 4, 5
         preceding(5) incorrect. File line,col= 1538,  15
                 Expected, Actual= 4, 0
code    alpha extend alphanum type word sent line name
------------------------------------------------ 0
    e42     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA O
    e2d     1      0        1   Lo   XX   LE   SA THAI CHARACTER O ANG
    e4d     1      1        0   Mn Extend   EX   SA THAI CHARACTER NIKHAHIT
    e19     1      0        1   Lo   XX   LE   SA THAI CHARACTER NO NU
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e2d     1      0        1   Lo   XX   LE   SA THAI CHARACTER O ANG
    e30     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA A
    e44     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA AI MAIMALAI
    e1b     1      0        1   Lo   XX   LE   SA THAI CHARACTER PO PLA
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e08     1      0        1   Lo   XX   LE   SA THAI CHARACTER CHO CHAN
    e39     1      1        0   Mn Extend   EX   SA THAI CHARACTER SARA UU
------------------------------------------------ 12
    e48     0      1        0   Mn Extend   EX   SA THAI CHARACTER MAI EK
    e27     1      0        1   Lo   XX   LE   SA THAI CHARACTER WO WAEN
    e32     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA AA
    e21     1      0        1   Lo   XX   LE   SA THAI CHARACTER MO MA
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e42     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA O
    e25     1      0        1   Lo   XX   LE   SA THAI CHARACTER LO LING
    e48     0      1        0   Mn Extend   EX   SA THAI CHARACTER MAI EK
    e19     1      0        1   Lo   XX   LE   SA THAI CHARACTER NO NU
         Forward Iteration, break expected, but not found.  Pos=  12  File line,col= 1538,  13
code    alpha extend alphanum type word sent line name
------------------------------------------------ 0
    e42     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA O
    e2d     1      0        1   Lo   XX   LE   SA THAI CHARACTER O ANG
    e4d     1      1        0   Mn Extend   EX   SA THAI CHARACTER NIKHAHIT
    e19     1      0        1   Lo   XX   LE   SA THAI CHARACTER NO NU
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
------------------------------------------------ 13
    e2d     1      0        1   Lo   XX   LE   SA THAI CHARACTER O ANG
    e30     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA A
    e44     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA AI MAIMALAI
    e1b     1      0        1   Lo   XX   LE   SA THAI CHARACTER PO PLA
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e08     1      0        1   Lo   XX   LE   SA THAI CHARACTER CHO CHAN
    e39     1      1        0   Mn Extend   EX   SA THAI CHARACTER SARA UU
    e48     0      1        0   Mn Extend   EX   SA THAI CHARACTER MAI EK
    e27     1      0        1   Lo   XX   LE   SA THAI CHARACTER WO WAEN
    e32     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA AA
    e21     1      0        1   Lo   XX   LE   SA THAI CHARACTER MO MA
     20     0      0        0   Zs WSegSpace   SP   SP SPACE
    e42     1      0        1   Lo   XX   LE   SA THAI CHARACTER SARA O
    e25     1      0        1   Lo   XX   LE   SA THAI CHARACTER LO LING
    e48     0      1        0   Mn Extend   EX   SA THAI CHARACTER MAI EK
    e19     1      0        1   Lo   XX   LE   SA THAI CHARACTER NO NU
         Forward Iteration, break found, but not expected.  Pos=  13  File line,col= 1538,  15
         Reverse Itertion, break found, but not expected.  Pos=  13  File line,col= 1538,  15
         Reverse Iteration, break expected, but not found.  Pos=  12  File line,col= 1538,  13
         isBoundary(12) incorrect. File line,col= 1538,  13
                 Expected, Actual= true, false
         isBoundary(13) incorrect. File line,col= 1538,  15
                 Expected, Actual= false, true
         following(0) incorrect. File line,col= 1538,   8
                 Expected, Actual= 12, 13
         following(1) incorrect. File line,col= 1538,   8
                 Expected, Actual= 12, 13
         following(2) incorrect. File line,col= 1538,   8
                 Expected, Actual= 12, 13
         following(3) incorrect. File line,col= 1538,  10
                 Expected, Actual= 12, 13
         following(4) incorrect. File line,col= 1538,  10
                 Expected, Actual= 12, 13
         following(5) incorrect. File line,col= 1538,  10
                 Expected, Actual= 12, 13
         following(6) incorrect. File line,col= 1538,  11
                 Expected, Actual= 12, 13
         following(7) incorrect. File line,col= 1538,  11
                 Expected, Actual= 12, 13
         following(8) incorrect. File line,col= 1538,  11
                 Expected, Actual= 12, 13
         following(9) incorrect. File line,col= 1538,  12
                 Expected, Actual= 12, 13
         following(10) incorrect. File line,col= 1538,  12
                 Expected, Actual= 12, 13
         following(11) incorrect. File line,col= 1538,  12
                 Expected, Actual= 12, 13
         following(12) incorrect. File line,col= 1538,  13
                 Expected, Actual= 26, 13
         preceding(28) incorrect. File line,col= 1538,  20
                 Expected, Actual= 12, 13
         preceding(27) incorrect. File line,col= 1538,  20
                 Expected, Actual= 12, 13
         preceding(26) incorrect. File line,col= 1538,  20
                 Expected, Actual= 12, 13
         preceding(25) incorrect. File line,col= 1538,  19
                 Expected, Actual= 12, 13
         preceding(24) incorrect. File line,col= 1538,  18
                 Expected, Actual= 12, 13
         preceding(23) incorrect. File line,col= 1538,  18
                 Expected, Actual= 12, 13
         preceding(22) incorrect. File line,col= 1538,  18
                 Expected, Actual= 12, 13
         preceding(21) incorrect. File line,col= 1538,  17
                 Expected, Actual= 12, 13
         preceding(20) incorrect. File line,col= 1538,  17
                 Expected, Actual= 12, 13
         preceding(19) incorrect. File line,col= 1538,  17
                 Expected, Actual= 12, 13
         preceding(18) incorrect. File line,col= 1538,  16
                 Expected, Actual= 12, 13
         preceding(17) incorrect. File line,col= 1538,  16
                 Expected, Actual= 12, 13
         preceding(16) incorrect. File line,col= 1538,  16
                 Expected, Actual= 12, 13
         preceding(15) incorrect. File line,col= 1538,  15
                 Expected, Actual= 12, 0
         preceding(14) incorrect. File line,col= 1538,  15
                 Expected, Actual= 12, 0
         preceding(13) incorrect. File line,col= 1538,  15
                 Expected, Actual= 12, 0
      
         } ERRORS (52) in TestExtended (38ms) 
      
   
      } ERRORS (52) in RBBITest (38ms) 
   

   } ERRORS (52) in rbbi (38ms) 


--------------------------------------
Errors in total: 52.
            TestExtended
         RBBITest
      rbbi
   
--------------------------------------

Nov 20 '23 19:11 FrankYFTang

You diff actually only add one test case

<data>•โอํน• อะไป •จู่วาม •โล่น•</data>

for line break but I think it is incorrect why should the line break happen before the space ? It should be

<data>•โอํน •อะไป •จู่วาม •โล่น•</data>

instead, right?

Nov 20 '23 19:11 FrankYFTang

Looking https://github.com/unicode-org/icu/pull/2676/files#diff-b177067bbc1df57fc40ae7629a81e8df960899b9088555b010680a1c500943e2 Also, the line

<line>
# Should no longer break at the dictionary points - it's not Thai language
...
#<data>•โอํน• •อะไป• •จู่วาม• •โล่น• •เปี่ยร• •อะลู่วาง• •แมะ,• •ปาย• •อัน• •แบ็จ• •อะโจํน• •ซา• •เมาะ.• •อัน• •ฮะบืน• •ตะ• •เวี่ยะ• •ตะ• •งี่ยาน,• •อัน• •ฮะบืน• •อีว• •อะปายฮ.•</data>

should be

<line>
# Should no longer break at the dictionary points - it's not Thai language
...
<data>•โอํน •อะไป •จู่วาม •โล่น •เปี่ยร •อะลู่วาง •แมะ, •ปาย •อัน •แบ็จ •อะโจํน •ซา •เมาะ. •อัน •ฮะบืน •ตะ •เวี่ยะ •ตะ •งี่ยาน, •อัน •ฮะบืน •อีว •อะปายฮ.•</data>

there are no reason to have a line break before the space. Line break should only happen after the SPACE not before the SPACE, right?

Nov 20 '23 19:11 FrankYFTang

Notice: the branch changed across the force-push!

icu4c/source/test/testdata/rbbitst.txt is now changed in the branch
icu4j/main/core/src/test/resources/com/ibm/icu/dev/test/rbbi/rbbitst.txt is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Nov 20 '23 21:11 jira-pull-request-webhook[bot]

@srl295 I copy your test change over but change it. Please read my modified version in this PR and see do you agree with that. The change are

for line break, there should have no line break before the SPACE
for word break, the status should be 200 not 0
for word break, we should break beefore . and , if we treat the Thai as AL.
not using dx=zyyyy . That part of spec is very bad. I file bug https://unicode-org.atlassian.net/browse/CLDR-17247 for that. I do not think we should implement that behavior. It is clearly a spec bug from my point of view.

Nov 20 '23 21:11 FrankYFTang

Notice: the branch changed across the force-push!

icu4c/source/common/rbbi.cpp is different
icu4c/source/test/testdata/rbbitst.txt is different
icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is different
icu4j/main/core/src/test/resources/com/ibm/icu/dev/test/rbbi/rbbitst.txt is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Nov 21 '23 19:11 jira-pull-request-webhook[bot]

Notice: the branch changed across the force-push!

icu4c/source/common/rbbi.cpp is different
icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Nov 21 '23 21:11 jira-pull-request-webhook[bot]

Notice: the branch changed across the force-push!

icu4c/source/common/brkiter.cpp is different
icu4c/source/common/rbbi.cpp is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Nov 21 '23 21:11 jira-pull-request-webhook[bot]

Let me try to be clearer.

Suppose that

The dictionary breakIterator will act on any characters marked A or B below, but will skip over C and D.
The dx vales need to cause the characters B and C to be skipped, but has no effect on characters A and D.

AAABBBCCCDDD

What should happen is that the dictionary break iterator should act on the characters AAA, and otherwise the RBNF rules will act on BBBCCCDDD.

From what I see of your code change, at the first A character, the dictionary breakIterator accepts it, and dx doesn't exclude it. So the dictionary's break iterator gets called. That seems clear.

What is not clear to me is how lbe.findBreaks knows to stop at the first B, because the break iterator internally has no access to the dx exclusion set, and there isn't any other change in your PR that would indicate some way that the iterator's results past the first B would be ignored.

Nov 23 '23 00:11 macchiati

Let me try to be clearer.

Suppose that

The dictionary breakIterator will act on any characters marked A or B below, but will skip over C and D.

The dx vales need to cause the characters B and C to be skipped, but has no effect on characters A and D.

AAABBBCCCDDD

What should happen is that the dictionary break iterator should act on the characters AAA, and otherwise the RBNF rules will act on BBBCCCDDD.

From what I see of your code change, at the first A character, the dictionary breakIterator accepts it, and dx doesn't exclude it. So the dictionary's break iterator gets called. That seems clear.

What is not clear to me is how lbe.findBreaks knows to stop at the first B, because the break iterator internally has no access to the dx exclusion set, and there isn't any other change in your PR that would indicate some way that the iterator's results past the first B would be ignored.

I see. ok, you are right, that is not clear. I need to call excludedFromDictionaryBreak to adjust the startRange and endRange before passing to the findBreaks. the startRange and endRange pass to the findBreaks may need to be changed to a different values excluding these characters.

Nov 23 '23 00:11 FrankYFTang

@FrankYFTang my apologies, I have not found spare time to review this recently. i will let @mhosken in case he's able to review.

it seems it's going in a good direction… certainly feel free to amend my test as as needed (it was meant as an example not as proscriptive) and close the other PR…

Nov 23 '23 01:11 srl295

Np, your feedback is great for catching problems!

On Wed, Dec 6, 2023 at 1:16 PM Frank Yung-Fong Tang < @.***> wrote:

@.**** commented on this pull request.

In icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java https://github.com/unicode-org/icu/pull/2702#discussion_r1417986914:
   String dxs) {
   if (dxs == null) {
       return null;
   }
   if (dxs.length() % 5 != 4) {
       throw new IllegalArgumentException("Incorrect value for dx key: " + dxs);
   }
   // Change from "thai" to "[[:scx=thai:]]" or "thai-arab" to "[[:scx=thai:][:scx=arab:]]"
   StringBuilder builder = new StringBuilder("[");
   int items = 1 + (dxs.length() / 5);
   for (int i = 0; i < items; i++) {
       if (i > 0 && dxs.charAt(i*5-1) != '-') {
           throw new IllegalArgumentException("Incorrect value for dx key: " + dxs);
       }
       String script = dxs.substring(i*5, i*5+4);
       // Special handling of zyyy
oh. Just saw what you landed in https://github.com/unicode-org/cldr/pull/3411/files that make sense. Sorry. Ignore my previous comments.

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/icu/pull/2702#discussion_r1417986914, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMFIZ2UB57SAEBL2FWDYIDOEBAVCNFSM6AAAAAA7LXMYCKVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTONRYGYZTKMRTGM . You are receiving this because you were assigned.Message ID: @.***>

Dec 06 '23 22:12 macchiati

Please ignore my update. I am still working on this PR. It is not ready for review. After Mark point out some issue, I found my design was wrong and need a more intensify rework.

Dec 12 '23 01:12 FrankYFTang

Notice: the branch changed across the force-push!

icu4c/source/common/rbbi.cpp is different
icu4c/source/test/intltest/rbbitst.cpp is different
icu4c/source/test/intltest/rbbitst.h is different
icu4j/main/core/src/main/java/com/ibm/icu/impl/breakiter/CjkBreakEngine.java is now changed in the branch
icu4j/main/core/src/main/java/com/ibm/icu/text/RuleBasedBreakIterator.java is different
icu4j/main/core/src/test/java/com/ibm/icu/dev/test/rbbi/RBBITest.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Dec 12 '23 01:12 jira-pull-request-webhook[bot]

I'm getting user complaints again on this. Can we action this. Some fix for disabling dictionary breaking has been requested since, well I can't find out since I can't get to the old bug tracker, but it came into the latest tracker in 2019.

Perhaps the best is the enemy of the good here? The only people, that I know of, that are affected by this are those using minority languages, are inserting ZWSP for word breaks and are dealing with correctly tagged text. Do we have to refine this fix for the non use cases as well, before we can fix for the actual use case?

I'm sorry that my frustration is showing. But we seem to be more concerned about people who do the wrong thing than those who do the right thing (and tag correctly, by some definition). The really correct solution is that if the text is not tagged with the language of the dictionary, then no dictionary breaking should occur. I realise that that is just too much for most people and so we have special tagging. But can we please get something out for these users who are able to tag correctly?

Please shipit already.

Apr 15 '24 10:04 mhosken

@FrankYFTang is this going to be merged for 75?

Apr 15 '24 14:04 srl295

no, the issue is more complicated than my PR did.

Apr 15 '24 22:04 FrankYFTang

no, the issue is more complicated than my PR did.

Do you have more detail?

Apr 16 '24 00:04 srl295

no, the issue is more complicated than my PR did.

Do you have more detail?

It require more detail analysis and testings for different cases than what I put into this PR. I missed some complicated combination.

Apr 16 '24 22:04 FrankYFTang

no, the issue is more complicated than my PR did.

Do you have more detail?

It require more detail analysis and testings for different cases than what I put into this PR. I missed some complicated combination.

Frank, It's opt-in. COuld we consider moving forward with this even if it needs additional work? It shouldn't be a problem for users that don't set -u-dx

Aug 09 '24 02:08 srl295

sorry, I need to pick up the work again.

Aug 12 '24 18:08 FrankYFTang