cldr icon indicating copy to clipboard operation
cldr copied to clipboard

CLDR-18187 Add complex segmentation to scriptMetadata.txt

Open sffc opened this issue 11 months ago • 13 comments

CLDR-18187

  • [x] This PR completes the ticket.

CC @eggrobin @makotokato @Manishearth

See ticket for details. The issue discussed in multiple CLDR Design WG meetings, but this specific solution was not.

ALLOW_MANY_COMMITS=true

sffc avatar Jan 06 '25 19:01 sffc

100% on the value of the data. Is it time to move this to an XML document though perhaps in supplemental? Could still output it as a .txt for release.

srl295 avatar Jan 07 '25 20:01 srl295

OK, I added it to the Java file and the Google Sheets.

While doing this, I realized, is this data meaningfully different from the column "LB letters"?

CC @markusicu

sffc avatar Jan 08 '25 23:01 sffc

We could potentially add a third value to the enumeration in the LB Letters column to distinguish scripts like Thai, which need a dictionary for word and line segmentation, from Han, which needs a dictionary for only word segmentation.

sffc avatar Jan 09 '25 00:01 sffc

Idea: Consider changing LBLetters(Hani) to "No" but adding WBLetters and making that "Yes" for Hani.

markusicu avatar Jan 09 '25 00:01 markusicu

Good idea!

macchiati avatar Jan 09 '25 00:01 macchiati

I think Shane's idea is a bit simpler. The question is whether we know of any APIs that reflect the value as a boolean; when they read the data they would need to make a code change.

On Wed, Jan 8, 2025 at 4:10 PM Markus Scherer @.***> wrote:

Idea: Consider changing LBLetters(Hani) to "No" but adding WBLetters and making that "Yes" for Hani.

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/cldr/pull/4262#issuecomment-2578924718, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMG7M5G42YYIC6HQ6TL2JW5ALAVCNFSM6AAAAABUWHSBTCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZYHEZDINZRHA . You are receiving this because your review was requested.Message ID: @.***>

macchiati avatar Jan 09 '25 00:01 macchiati

I kind-of like a new column because (1) it doesn't break users of the old column and (2) it would potentially allow for scripts that need special rules for line break but not for word break (say, line break allowed on syllable boundaries).

sffc avatar Jan 09 '25 00:01 sffc

# 7 - LB letters:
#		YES if the major languages using the script allow linebreaks between letters (excluding hyphenation). 
#		Derived from LB property.

How is that derivation actually done? Depending on how you interpret between letters, the values in this file look wrong (or at the very least inconsistent) for all but one of the scripts that use the Brahmic style of line breaking (see https://www.unicode.org/reports/tr14/#BreakOpportunities).

Bali; 33; 1B05; ID; 1; LIMITED_USE; NO; NO; YES; NO; NO; NO Batk; 33; 1BC0; ID; 1; LIMITED_USE; NO; NO; YES; NO; NO; NO Brah; 33; 11005; IN; 1; EXCLUSION; NO; NO; YES; NO; NO; NO Cham; 33; AA00; VN; 1; LIMITED_USE; NO; NO; YES; NO; NO; NO Diak; 33; 1190C; MV; 1; EXCLUSION; NO; NO; YES; YES; NO; NO Gran; 33; 11315; IN; 1; EXCLUSION; NO; NO; NO; NO; NO; NO Gukh; 33; 1611C; NP; 1; EXCLUSION; NO; NO; YES; NO; NO; NO Java; 33; A984; ID; 1; LIMITED_USE; NO; NO; YES; NO; NO; NO Kawi; 33; 11F1B; ID; 1; EXCLUSION; NO; YES; YES; NO; NO; NO # That one looks correct. Maka; 33; 11EE5; ID; 1; EXCLUSION; NO; NO; MIN; NO; NO; NO Tutg; 33; 11392; IN; 1; EXCLUSION; NO; NO; YES; NO; NO; NO

eggrobin avatar Jan 09 '25 00:01 eggrobin

Bali, Java, Hatr, and Elym have comments in the spreadsheet saying that they might be wrong.

But, if we go by that description of the column, I would expect Thai to be "NO" because Thai should have line-breaks at word boundaries. I've seen bugs before where the break engine found breaks in the middle of words and it was wrong.

sffc avatar Jan 09 '25 00:01 sffc

Shane: The description of LB letters doesn't reference word breaks at all. It is just a question of whether you can get line breaks between two characters XY, where X and Y are letters of that script.

Robin: The spreadsheet data for that column isn't derived, and probably predates https://www.unicode.org/reports/tr14/#LB28a. Ideally the data would be maintained in the UCD, but the UTC didn't want to have script metadata when the subject was raised (ages ago). If it were, we could have invariant tests for that.

On Wed, Jan 8, 2025 at 4:46 PM Shane F. Carr @.***> wrote:

Bali, Java, Hatr, and Elym have comments in the spreadsheet https://docs.google.com/spreadsheets/d/1Y90M0Ie3MUJ6UVCRDOypOtijlMDLNNyyLk36T6iMu0o/edit?gid=0#gid=0 saying that they might be wrong.

But, if we go by that description of the column, I would expect Thai to be "NO" because Thai should have line-breaks at word boundaries. I've seen bugs before where the break engine found breaks in the middle of words and it was wrong.

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/cldr/pull/4262#issuecomment-2578964547, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMFXCRKTICFUPU7IFND2JXBF5AVCNFSM6AAAAABUWHSBTCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDKNZYHE3DINJUG4 . You are receiving this because your review was requested.Message ID: @.***>

macchiati avatar Jan 09 '25 01:01 macchiati

Seems I dropped the ball on following up here. We were last discussing what the schema should be.

Some ideas:

  1. New column "Complex Segmentation Required" with values NO, W, and WL (current proposal in this PR)
  2. Change column "LB letters" to add a third enumeration value, perhaps W, to indicate that line breaking between letters is only permitted at a word boundary
  3. Add a column "LB words" with values YES and NO to indicate whether line breaks should occur at word boundaries. With this new column, Thai would be YES and Hani would be NO. I guess most scripts should also say YES, although it might be confusing since UAX 14 has different rules than UAX 29
  4. Add a column "LB any letter" with values YES and NO to indicate whether line breaks can occur at any letter boundary. With this new column, Hani would be YES and Thai would be NO. Most scripts would say NO.

With options 2-4, we might also need a column "WB letters" indicating whether word breaks can occur between letters.

A complete scheme could then be:

Script LB letters WB letters LB letter at WB
Latn NO NO NA
Hani YES YES NO
Thai YES YES YES
A* YES NO NA
B* NO YES NA

Scripts A and B are placeholders for potential scripts with the following behavior:

  • A: Words are not broken between letters, but line breaks can be between letters. Example: a script that uses spaces but still allows mid-word line breaks. Think of Japanese but with spaces.
  • B: Lines are not broken between letters, but words can be broken between letters. Example: a script that uses spaces and always wraps lines at spaces, but contains compound words that are split between letters. Think of German with long compound words being split up in the WordSegmenter but not the LineSegmenter.

I think my original proposal can fully express this, though a bit differently:

Script LB letters (existing) Complex Segmentation Required
Latn NO NO
Hani YES W
Thai YES WL
A YES NO
B NO W

I think NO+WL is not possible because the complex segmentation for line break implies line breaks between letters.

sffc avatar Sep 08 '25 04:09 sffc

I see that there was a question about whether there is a boolean-returning API for “LB Letters”. The answer is yes, in both ICU4C & ICU4J. https://unicode-org.github.io/icu-docs/apidoc/released/icu4j/com/ibm/icu/lang/UScript.html#breaksBetweenLetters-int-

markusicu avatar Sep 09 '25 20:09 markusicu

Markus pointed out scripts like Javanese, which have line breaks between orthographic syllables, (also Balinese, Brahmi, Kawi, ...), but require a dictionary for word breaks.

Example: http://efele.net/udhr/d/udhr_jav_java.html

sffc avatar Nov 03 '25 19:11 sffc