icu4x Should the Segmenter types accept a locale?

trafficstars

In the API review, @markusicu pointed out that ICU takes a locale in the segmenter, and the locale affects the behavior in certain cases, such as those in the data files below:

Data bundles that contain some language-specific data for sentence segmentation: https://github.com/unicode-org/icu/tree/main/icu4c/source/data/brkitr
fi_sv override for word break: https://github.com/unicode-org/icu/blob/main/icu4c/source/data/brkitr/rules/word_fi_sv.txt
el override for sentence break: https://github.com/unicode-org/icu/blob/main/icu4c/source/data/brkitr/rules/sent_el.txt

Why don't we support these in ICU4X Segmenter, and should we add them?

For 1.2 purposes, we have a few choices:

Add the locale parameter now and don't use it for anything yet
Don't add the locale parameter but add something like _invariant to the constructor names, so that in the future try_new_auto_invariant() creates the locale-invariant segmenter and try_new_auto(locale!("el")) creates the locale-specific segmenter
Keep things the way they are and add locale constructors later, possibly adopting the style above in 2.0
Add the parameter to Word and Sentence, but not Line or Grapheme

Thoughts?

@aethanyc @makotokato @Manishearth

Apr 11 '23 00:04 sffc

General preference for #3

Do we plan to provide these locale-ish APIs in the near term? I actually think future try_new_auto_with_locale() would be fine

Apr 11 '23 00:04 Manishearth

General preference for #2

I'd prefer the default constructor names to be consistent in behavior as much as makes sense. If we believe Segmenter constructors will want to take locale just like all others, lets keep the names for those constructors. If, in the future, we decide to not add those names, we can always alias the default constructor names to the _invariant ones.

Apr 11 '23 04:04 zbraniecki

Why this ICU4C rule isn't merged/requested to UAX#29? Does ICU4C have a plan to file/merge an issue to UAX#29? After merging this change to UAX#29, then #3.

Apr 11 '23 11:04 makotokato

UAX #29 in general doesn't really want to include locale-specific stuff because it wants to leave that up to CLDR.

Apr 11 '23 15:04 Manishearth

Suffix suggestions:

_invariant
_root (wrong: we don't have CLDR root tailorings)
_uax (too restrictive: we want to add CLDR tailorings)
_untailored (wrong: LineBreakOptions has tailorings)
_default (could be restrictive with regard to default data)

Apr 11 '23 15:04 sffc

Since we will want to take locales as parameters (even for segmenters where that isn't implemented yet), IMO we should make that the "normal" case.

Apr 11 '23 17:04 macchiati

From discussion with @aethanyc @makotokato @Manishearth @nordzilla: It is an enhancement to consume CLDR root.xml tailorings, but not necessarily a bug. We would like to see it done in a timely fashion.

Conclusion: use _invariant

Apr 11 '23 23:04 sffc

For the record we didn't use _invariant in #3294

Apr 13 '23 14:04 robertbastian

Data bundles that contain some language-specific data for sentence segmentation: https://github.com/unicode-org/icu/tree/main/icu4c/source/data/brkitr

These seem to be lists of abbreviations that contain a period that doesn't end a sentence. How bad would it be to merge the lists and use the merged lists across languages?

fi_sv override for word break: https://github.com/unicode-org/icu/blob/main/icu4c/source/data/brkitr/rules/word_fi_sv.txt

It's a bit sad that treating letter, colon, letter as having a word break opportunity after the colon is a case of giving computer syntax needs precedence over natural-language needs. If accommodating computer syntaxes wasn't given priority, the Finnish/Swedish requirement of not treating letter, colon, letter as containing a word break opportunity could be hoisted to root.

el override for sentence break: https://github.com/unicode-org/icu/blob/main/icu4c/source/data/brkitr/rules/sent_el.txt

This seems to be about ASCII semicolon having sentence-ending question mark semantics. Could this be accommodated in the root by triggering the rule on the most recent letter being from the Greek script?

Aug 31 '23 16:08 hsivonen

It's a bit sad that treating letter, colon, letter as having a word break opportunity after the colon is a case of giving computer syntax needs precedence over natural-language needs. If accommodating computer syntaxes wasn't given priority, the Finnish/Swedish requirement of not treating letter, colon, letter as containing a word break opportunity could be hoisted to root.

I think German needs this tailoring as well. I don't know why Finnish and Swedish do, but in German a colon is commonly used to form gender-neutral nouns, like Lehrer:in, which should not contain any word breaks.

What's the current process for updating the tailorings? ICU or CLDR?

Jan 24 '24 10:01 robertbastian

@robertbastian this was recently discussed in the CLDR design meeting, CLDR issue: https://unicode-org.atlassian.net/browse/CLDR-15910 / PAG issue: https://github.com/unicode-org/properties/issues/187 (internal)

There's thought that this should actually be made to apply to all languages, since colons without spaces on either side are not really a thing in regular text anyway, and if the space has been removed there's a good chance it's on purpose.

Jan 24 '24 18:01 Manishearth

Data bundles that contain some language-specific data for sentence segmentation: https://github.com/unicode-org/icu/tree/main/icu4c/source/data/brkitr

These seem to be lists of abbreviations that contain a period that doesn't end a sentence. How bad would it be to merge the lists and use the merged lists across languages?

That's exactly what they are. https://www.unicode.org/reports/tr35/tr35-general.html#Segmentation_Exceptions for more details.

The lists for one language may not be applicable for others. But you could probably calculate a list that's likely to be generally useful, it might be less useful for any particular language.

Jan 24 '24 21:01 srl295

I think German needs this tailoring as well.

I think https://unicode-org.atlassian.net/browse/CLDR-15910 should be reverted on the root level so that we don't need tailorings to accommodate natural languages.

I don't know why Finnish and Swedish do

For Finnish, the use case is marking where a sufficiently unusual word body (e.g. acronym) ends and the case suffix starts. For example, English Henri’s would be Henrin in Finnish but English ICU4X’s would be ICU4X:n in Finnish. The use case for Swedish seems to be also about applying suffixes (though not case suffixes) to sufficiently unusual word bodies. (Consider an analog English Londoner but with with the suffix applied to e.g. a sports team acronym.)

Jan 25 '24 08:01 hsivonen

Okay, for 2.0 purposes, which of the four segmenters requires a locale parameter?

Grapheme: Are there locale-specific CLDR tailorings for graphemes?
Word: It sounds like people want to move the fi_sv tailorings to the root, which would obviate the need for RBBI tailoring. However, locale info could still help with complex language segmentation, although we need to know the language of the text, not of the user.
Sentence: Seems like this is the biggest use case, although it is still about the language of the text and not of the user.
Line: I think this one is invariant.

The "language of the text" would be more appropriate to provide in the terminal segment function since it is an attribute of the text, but since that requires data loading, it might be more appropriate to specify it in the constructor. Alternatively, we could stuff all sentence language tailorings into a single data key which is always loaded when making a sentence segmenter, as we do for the word break segmenter.

Jan 26 '24 00:01 sffc

Grapheme: Are there locale-specific CLDR tailorings for graphemes?

Not yet. Please put it into the API. I was doing planning on a work item to move this forward. This is for example languages that want to keep "ch" together etc.

Jan 26 '24 04:01 srl295

Grapheme

Please put it into the API.

On the flip side, putting this in the API really requires making ECMA-402 have a way to explicitly ask for root and to default to root.

Some users getting a different definition of extended grapheme clusters based on the browser UI locale would likely be bad, after developers having assumed for years that extended grapheme clusters are a Unicode-level concept and not a locale-level concept. Also, it would be bad to have to assume that English is always going to be the untailored language and to teach every developer to ask for a grapheme segmenter for English in order to get behavior on a similar level of stability that one would expect of e.g Swift strings.

This is for example languages that want to keep "ch" together etc.

What languages do you mean and why do they want to keep "ch" together for the kind of purposes that extended grapheme clusters are used for, such as denying the selection of only "c" or only "h"? Czech treats "ch" as a collation unit, but do users of the language expect not to be able to select "c" and "h" individually?

Jan 26 '24 13:01 hsivonen

We (poles and chechs) expect to be able to select each letter separately but would really prefer not to break lines between those two letters when they're next to each other.

Jan 29 '24 05:01 zbraniecki

also generally speaking not having a locale option is likely to disproportionately cause challenges for minority users of the script. Yes, you might seem to have a common solution for majority users, but then there isn't a customization option if something goes against the majority use. sss-Thai line breaking is an example.

Jan 29 '24 22:01 srl295

Discuss with:

@eggrobin
@Manishearth
@hsivonen

Optional:

@sffc
@markusicu
@zbraniecki

Feb 01 '24 18:02 sffc

@eggrobin - Currently we don't have a locale parameter in Segmenter APIs at all?
@sffc - Currently the complex break engine is used when complex script characters are encountered.
@eggrobin
@hsivonen - There is a combined CJK dictionary for the Han script. I'm not capable of evaluating the quality of it, but it is the same one as in ICU4C. My understanding is that it works for Japanese and Mandarin and the subset of Cantonese that looks like Mandarin, but it doesn't work for colloquial Cantonese. I don't know whether adding colloquial Cantonese to the mix would hurt the segmentation for other languages.
@Manishearth - My feeling is that it would be unlikely that adding colloquial Cantonese would harm other languages.
@hsivonen - It seems to me that there may be cases where the hint is useful, but until we have those cases, it could be harmful to the users of the API where it looks like a locale is used but it's really not. This is the state we're currently in with ECMA-402.
@eggrobin - CLDR has decided that they're putting the : thing back into root.
@hsivonen - How bad would it be if the abbreviation dictionaries were merged? The Greek thing (question mark) should be decideable from the context; if there is a Greek character close enough... what do we currently know about, how much could we do without having a language hint, and then what does the language hint buy us? So I'm concerned about having an API that does basically nothing.
@hsivonen - I'm a bit worried about the notion that the grapheme cluster would need a language parameter in order to keep CH together. I'm skeptical about users of languages who would want to select CH together, for example, since that's a place where grapheme segmentation is used today. Swift ended up with the notion that extended grapheme clusters are super fundamental. There has been years and years of opportunity for users to think of grapheme clusters as a Unicode-level thing instead of a locale-level thing. One way to get ahead is to make ECMA-402's default for segmentation to be the root locale.
@hsivonen - So in summary that's why I think it's harmful to add these parameters ahead of a time when they actually do anything.
@eggrobin - I agree with @hsivonen on Grapheme segmentation. It is baked into Swift, it is used in C++23 in std::format, ... so trying to make it language-dependent is a ship that sailed in 2003. If there is someone who wanted to tailor grapheme cluster segmentation, we need to understand their use case and given them their own API. I have fewer opinions about word segmentation.
@sffc - Two things. (1) sentence segmentation probably needs hints. "App." is an abbreviation in German but it is a word in English. (2) Embedding the language inside of the string of text could have some advantages, because they can carry hints for multilingual text in the same string, and it doesn't need to impact an API.
@hsivonen - Your keyboard could generate these language tags, but that doesn't help with English text written with a Finnish keybaord layout, for example. So the language hints embedded in the string only works when you switch keyboards.
@eggrobin - About carrying language tags into the text, that is something about the representation of plain text, and therefore something we should discuss with UTC. The tag characters in Plane 14 « hazmat disposal », now solely used for subregion flags were initially introduced for such a mechanism, which was then deprecated. Given that they are deprecated, the UTC may not be keen to resurrect the language tags.
@hsivonen - There is the "language attribute" in the Web platform. In the browser context, where you separate what is human-facing, ... I don't expect a new way of doing language tagging on the Web to fly. What we have is the lang attribute, and that's it.

https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/lang

@hsivonen - One question is when do you pass these into the segmenter? In the constructor or individual segment calls? Initially it looks troublesom to rerun the constructor every time.
@hsivonen - Overall, it seems that sentence segmentation has the strongest use case for a language parameter. We can probably get away without it for word segmentation.
@sffc - There are some more examples where CJK have tailorings for word and line break.

https://github.com/unicode-org/cldr/blob/main/common/segments/ja.xml

@hsivonen - Well, maybe these tailorings should also be in the root. Or, can we apply these rules dispatch on the script in a similar way as we dispatch for Khmer, Thai, etc?
@eggrobin - On line breaking, there's one thing ICU doesn't do that UAX 14 recommends. Resolving quotation marks based on language. qu, ambiguous quotation marks. I think different languages use quotation marks in different ways. There are heuristics baked into the algorithm. There are also single quotation marks, which are used very differently across languages. I intend to improve the heuristics, but at the end of the day they are just heuristics, and a language hint could help, and I understand that this is done by Apple.
@hsivonen - Is relying on the language hint more successful than relying on the heuristics? For example, whether there is space on one or both sides.
@eggrobin - Yeah, that's how it works.
@hsivonen - If you take the language from the UI, there's a good chance that the heuristic is more accurate.
@eggrobin - You can make heuristics that are overly cautious. And if you have a language hint, well the hint needs to be right or else all bets are off.
@Manishearth - Some social media sites tend to autotag languages, often incorrectly
@sffc - It seems like the language hint should probably be an optional parameter. If present, then we can assume it is correct.
@eggrobin - @sffc said that language hints probably matter the most for sentence segmentation, but sentence segmentation is the segmentation that matters the least. It is already broken: I can write "George W. Bush" and it is 2 sentences.
@hsivonen - I've heard requests for sentence segmentation as applied to machine learning.
@echeran - Sentence segmentation gets used in localization tools a lot
@hsivonen - @eggrobin, what's your view on semicolon being a Greek question mark?
@eggrobin - I don't like to think about sentence segmentation. We could stuff a heuristic in there. (shrugs)
@eggrobin - Besides the fi/sv tailoring being upstreamed, it is a good illustration of the kind of thing we could see there.

Initial recommendations:

No language parameter for grapheme cluster segmenter

LGTM: @sffc @eggrobin @hsivonen @Manishearth

Language parameter for the other three segmenters should be optional parameter, potentially on a separate function, well-documented and geared toward power users who can ensure the quality of the hint

LGTM: @sffc @hsivonen @eggrobin @Manishearth

Plan to put the parameter on the .segment() function, dependent on investigation of the call sites in order to optimize data slicing.

LGTM: @sffc @hsivonen @Manishearth

@echeran - +1 for review and alignment

Mar 12 '24 15:03 sffc

colloquial Cantonese

Henri later provided me context: https://github.com/tc39/proposal-intl-segmenter/issues/133#issuecomment-779777282

Since "colloquial cantonese" is ambiguous, I'll expand on this a bit: the distinction being talked about is "standard written Chinese" (which can be used to write spoken Cantonese but uses a lot more Mandarin vocabulary — it's complicated), vs "written Cantonese", which is more of a 1-1 mapping to spoken Cantonese and may use Cantonese-specific words and characters.

It's also sometimes called "written vernacular Cantonese" however this can be somewhat ambiguous since Written Vernacular Chinese is something else.

Mar 12 '24 15:03 Manishearth

CLDR has decided that they're putting the : thing back into root.

Yes, but the Apple rep in the meeting, who is originally from Sweden, insisted that CLDR keep the fi/sv word break tailorings, because he thinks that even the future keyword selecting technical usage should keep fi/sv words together across colon.

No language parameter for grapheme cluster segmenter

+1

Language parameter for the other three segmenters

+1

Plan to put the parameter on the .segment() function

That seems weird both from looking at usage and thinking about data loading.

I strongly expect that someone should be able to get a Segmenter object and just use it to find/iterate over segments without knowing about additional options.
I expect tailorings to require some different data that should be loaded in the "constructor". Depending on the implementation, there may be a totally different blob or a small delta, but probably generally some non-zero tailoring-specific data.

Mar 12 '24 17:03 markusicu

The conclusions from the discussion of this issue with the CLDR design group:

Grapheme clusters should not be language-specific; baked into much low-level processing (e.g., Swift, font mappings) which we don’t want to be language-specific
Content locale/text language parameter (not UI locale): Potential for accuracy; make it optional, name it well
Ok to leave the locale on the constructor; benefit: more specific data loading even for existing dictionaries & models

My suggested path forward for this issue, then, is to add an options bag to the WordSegmenter, LineSegmenter, and SentenceSegmenter constructors with an optional content_locale field of type &LanguageIdentifier.

Apr 01 '24 22:04 sffc

I'm moving this back into 1.5 because the constructor can be drafted and bikeshed ahead of time, and then in 2.0 we can do the minimal change of making the new constructor the default one.

Apr 01 '24 23:04 sffc

Grapheme clusters should not be language-specific; baked into much low-level processing (e.g., Swift, font mappings) which we don’t want to be language-specific

This makes no sense and contradicts the long standing requests. ( https://unicode-org.atlassian.net/browse/CLDR-2992 which I am working on scheduling ) I would have joined, did not realize this was coming up today.

Perusing the notes it's not clear that the previous requirements and recent discussion from the segmentation summary last year were included here.

Apr 02 '24 02:04 srl295

Based on additional discussion in the email thread, I would like to move forward with the recommendation in https://github.com/unicode-org/icu4x/issues/3284#issuecomment-2030731133, with the additional understanding that we may add support for locale-based grapheme segmentation in the future if CLDR adds data for this, but it might take the form of another (fifth) segmenter type.

Concretely:

All segmenters retain a new or try_new function without an options bag
Word, Sentence, and Line segmenters get a try_new_with_options function that includes a content_locale option

May 16 '24 23:05 sffc

When looking ICU4C brkiter rule files for word and sentence, UAX#29's property of this isn't same each locale. But rules seem to be same. So if we modify datagen (with a few changes of toml data file), we can generate rules data per locale.

Jun 27 '24 01:06 makotokato

icu4x icu4x copied to clipboard

Should the Segmenter types accept a locale?

icu4x
icu4x copied to clipboard