ecma402 Intl.Segmenter with URLs, email addresses, and acronyms

trafficstars

Consider the following string of text:

Please P.T.O 123 times. 123.3242432 at 12:43 PM on 21/03/2013. or [email protected]. www.google.com and she's there and well-played.

What segments should we produce?

ICU currently produces

Please
P.T.O
123
times
123.3242432
at
12
43
PM
on
21
03
2013
or
example
gmail.com
www.google.com
and
she's
there
and
well
played

But V8 splits the acronym, URL, and email address into separate tokens.

Questions:

Should this be implementation-dependent behavior, or should it go into the spec?
Which behavior should be the default?
Should this be an API option?

See https://bugs.chromium.org/p/chromium/issues/detail?id=1301830

Mar 01 '22 01:03 sffc

FYI @makotokato @aethanyc @jfkthame

Mar 01 '22 02:03 zbraniecki

I think this should go upstream into CLDR, perhaps as API switchable behavior.

Mar 02 '22 18:03 srl295

ECMA-402 text segmentation boundary determination is intentionally implementation-dependent, as documented at https://tc39.es/ecma402/#annex-implementation-dependent-behaviour and https://tc39.es/ecma402/#sec-findboundary .

Boundary determination is implementation-dependent, but general default algorithms are specified in Unicode Standard Annex 29 (available at https://www.unicode.org/reports/tr29/). It is recommended that implementations use locale-sensitive tailorings such as those provided by the Common Locale Data Repository (available at http://cldr.unicode.org/).

It may also be worth noting that the default word boundary rules of UAX 29 don't allow breaking apart "P.T.O" or "gmail.com" or "www.google.com", per WB6 and WB7—V8's behavior is precisely the sort of behavior intended to be protected by that clause (although I personally would recommend against this particular deviation).

I would also discourage getting stuck in the morass of possible API switches, unless it is something as general (and advanced) as explicit specification of boundary rules in a form similar to that of UAX 29.

Mar 02 '22 19:03 gibson042

The use case comes from a Google team that is implementing a spell checker via word segmentation.

We want to detect URLs as a whole so that we can refrain from spellchecking them and giving erratic spelling suggestions to users.

It seems like we could form some use cases for word segmentation:

When you want real words (e.g., spell checker)
When you want smaller tokens (e.g., cursor positioning)

I think it would be reasonable to spell that out in the spec. Pick one as the default behavior and make the other one configurable with an API setting.

Mar 02 '22 21:03 sffc

For the record, ICU4X word segmenter produces the same output as ICU.

Please
P.T.O
123
times
123.3242432
at
12
43
PM
on
21
03
2013
or
example
gmail.com
www.google.com
and
she's
there
and
well
played

The code that generates the result. (The segment starts with a punctuation are removed.)

let s = "Please P.T.O 123 times. 123.3242432 at 12:43 PM on 21/03/2013. or [email protected]. www.google.com and she's there and well-played.";
let provider = RuleBreakDataProvider;
let segmenter = WordBreakSegmenter::try_new(&provider).expect("Data exists");
let breakpoints: Vec<usize> = segmenter.segment_str(s).collect();
for w in breakpoints.windows(2) {
    let begin = w[0];
    let end = w[1];
    if !s[begin..end].starts_with(&[' ', ':', '@', '.', '/', '-']) {
        println!("{}", &s[begin..end]);
    }
}

Mar 02 '22 23:03 aethanyc

I think it would be reasonable to spell that out in the spec. Pick one as the default behavior and make the other one configurable with an API setting.

The problem is that there are at least a dozen similar edge cases—domain names, email addresses, IP addresses, ~~URIs~~ IRIs, hashtags, @-references, Markdown/bbcode/etc. formatting, Latin abbreviations, hyphenated compounds, dates/times/datetimes, emoticons/kaomoji, etc.

"Long" vs. "short" seems too coarse, and a collection of fine-tuning options seems even worse.

Mar 02 '22 23:03 gibson042

I agree that we don't want an explosion of options for each edge case. I think we could look at use cases, though, and center options around those use cases. We already have "word", "sentence", and "grapheme" (and perhaps "line" at some point); we could add a new one called "token", for example, and say that "word" should be full words only, and "token" can produce smaller tokens.

Mar 03 '22 08:03 sffc

That seems like the kind of thing that should be pushed for in Unicode so that ECMA-402 can adopt it as a downstream consumer.

Mar 03 '22 15:03 gibson042

Are some of these things that some kind of independent recognizer should find? I. E. URLs, email, hashtags etc. maybe you want some preprocess operation that detects such and replaces them for processing. Then you could add Btc hash, git hash, or anything new that comes along.

Markdown and bbcode would fit in this category also, and wouldn't make sense for a general purpose plain text segménter

Mar 03 '22 19:03 srl295

From a web compatibility perspective, it would be bad if a product had a really good spell checker implementation when backed by V8 due to custom Intl.Segmenter changes, but then other engines had a bad spell checker experience due to differences in platform implementation.

Mar 04 '22 14:03 gregtatum

@aethanyc my understanding was that lwbrk (current Gecko's layout segmenter) does special things for web-compat segmentation (@ sign, :// etc.) - I'm not sure if it's specific to CJK or all around. In the original pitch, we talked about upstreaming those customizations to ICU4X Segmenter either as CLDR or as a "web-mode" overlay. Can you comment on that?

Mar 17 '22 20:03 zbraniecki

TG2 discussion: https://github.com/tc39/ecma402/blob/master/meetings/notes-2022-03-17.md#intlsegmenter-with-urls-email-addresses-and-acronyms

Conclusion:

FYT to investigate the V8 change
Come back to this group if necessary

Mar 17 '22 21:03 sffc

We would also like to push for an improvement in https://www.unicode.org/reports/tr29/#Word_Boundary_Rules that provides an example above WB6 similar to the one above WB8, but don't know how to pursue that.

-Do not break letters across certain punctuation.
+Do not break letters across certain punctuation, such as within “e.g” or “example.com”.

Jun 16 '22 18:06 gibson042

Another TC39-TG2 discussion: https://github.com/tc39/ecma402/blob/master/meetings/notes-2022-06-16.md#intlsegmenter-with-urls-email-addresses-and-acronyms-656

We agree that this is a bug in Chrome, but there could be phrasing improvements in UTS-35 as well as additional tests in Test262.

Jun 17 '22 04:06 sffc

Another TC39-TG2 discussion: https://github.com/tc39/ecma402/blob/master/meetings/notes-2022-06-16.md#intlsegmenter-with-urls-email-addresses-and-acronyms-656

We agree that this is a bug in Chrome, but there could be phrasing improvements in UTS-35 as well as additional tests in Test262.

I wrapped up some basic tests in https://github.com/tc39/test262/pull/3577/files

And found several potential issues due to not having specific rules for word segmentation mostly with email addresses and some results running it in different browsers + ICU were different from what I expected as a user.

Example :

"[email protected] sending message to [email protected]"

[Log] my
[Log] @
[Log] mail.com
[Log]  
[Log] sending
[Log]  
[Log] message
[Log]  
[Log] to
[Log]  
[Log] my
[Log] @
[Log] mail.org

What are the next steps? IMHO I believe we should extend the recommendations on how to deal with this segmentation/ word Boundary that covers at least the popular use cases like emails.

Jun 17 '22 10:06 romulocintra

@macchiati says to fill in the form at https://corp.unicode.org/reporting/error.html with suggestions.

He says that it may be useful to investigate the @ symbol. He says:

BTW, @ might be a fairly clear case. In the Olden Days, you'd only use @ in cases like "3 pieces @ $15 each", but nowadays by far the most prominent cases are in email or tags (@sffc). So we could consider proposing that something like that the following shouldn't break in word segmentation.

Letter Digit* @ and @ Letter

(off the top of my head; would have to look at the details)

Jun 18 '22 00:06 sffc

Additional discussion on 2022-07-07: https://github.com/tc39/ecma402/blob/master/meetings/notes-2022-07-07.md#intlsegmenter-with-urls-email-addresses-and-acronyms

Jul 07 '22 23:07 sffc

There a lot to unpack in that conversation. I'm OOO, but one point to add is that UAX#29 and UAX#14 both specify precise default algorithms, but do and need to allow for a great deal of customization for different environments, different languages, etc.

We should discuss what could to done help with what tc39 is doing, perhaps with some explicit profiles or som other mechanism.

On Thu, Jul 7, 2022 at 5:39 PM Shane F. Carr @.***> wrote:

Additional discussion on 2022-07-07: https://github.com/tc39/ecma402/blob/master/meetings/notes-2022-07-07.md#intlsegmenter-with-urls-email-addresses-and-acronyms

— Reply to this email directly, view it on GitHub https://github.com/tc39/ecma402/issues/656#issuecomment-1178373084, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMEEEYSYT7NKJL2VKM3VS5TCPANCNFSM5PSTOEHA . You are receiving this because you were mentioned.Message ID: @.***>

Jul 08 '22 16:07 macchiati

Additional point - icu4x segmenter architecture diverged from icu4c as a result of the design decisions driven by gecko layout segmentation needs to allow for runtime switches and overlays allowing for cheap context switching mid segmentation. This is something that icu4c requires build time dictionary/rules rebuild, and icu4c can do in fly.

Jul 09 '22 01:07 zbraniecki

@zbraniecki did you miss icu4x in the last comment?

Jul 12 '22 09:07 ryzokuken

@sffc after being so pushy during the monthly meeting, I was scrolling through tests today with @romulocintra and realized that we have tests for locale-specific behavior, for example the output of a NumberFormat#format operation for scientific notation in German.

This challenges my assumption that we should only test specified behavior in test262 and based on that precedent, I'll be quite alright with including the tests in test262. Apologies for being so assertive about this matter without finishing my research.

Jul 13 '22 12:07 ryzokuken

CLDR issue: https://unicode-org.atlassian.net/browse/CLDR-15767

Another: https://unicode-org.atlassian.net/browse/CLDR-15839

Jan 12 '23 20:01 sffc

ecma402 ecma402 copied to clipboard

Intl.Segmenter with URLs, email addresses, and acronyms

ecma402
ecma402 copied to clipboard