ecma402
ecma402 copied to clipboard
Intl.Segmenter with URLs, email addresses, and acronyms
Consider the following string of text:
Please P.T.O 123 times. 123.3242432 at 12:43 PM on 21/03/2013. or [email protected]. www.google.com and she's there and well-played.
What segments should we produce?
ICU currently produces
Please
P.T.O
123
times
123.3242432
at
12
43
PM
on
21
03
2013
or
example
gmail.com
www.google.com
and
she's
there
and
well
played
But V8 splits the acronym, URL, and email address into separate tokens.
Questions:
- Should this be implementation-dependent behavior, or should it go into the spec?
- Which behavior should be the default?
- Should this be an API option?
See https://bugs.chromium.org/p/chromium/issues/detail?id=1301830
FYI @makotokato @aethanyc @jfkthame
I think this should go upstream into CLDR, perhaps as API switchable behavior.
ECMA-402 text segmentation boundary determination is intentionally implementation-dependent, as documented at https://tc39.es/ecma402/#annex-implementation-dependent-behaviour and https://tc39.es/ecma402/#sec-findboundary .
Boundary determination is implementation-dependent, but general default algorithms are specified in Unicode Standard Annex 29 (available at https://www.unicode.org/reports/tr29/). It is recommended that implementations use locale-sensitive tailorings such as those provided by the Common Locale Data Repository (available at http://cldr.unicode.org/).
It may also be worth noting that the default word boundary rules of UAX 29 don't allow breaking apart "P.T.O" or "gmail.com" or "www.google.com", per WB6 and WB7—V8's behavior is precisely the sort of behavior intended to be protected by that clause (although I personally would recommend against this particular deviation).
I would also discourage getting stuck in the morass of possible API switches, unless it is something as general (and advanced) as explicit specification of boundary rules in a form similar to that of UAX 29.
The use case comes from a Google team that is implementing a spell checker via word segmentation.
We want to detect URLs as a whole so that we can refrain from spellchecking them and giving erratic spelling suggestions to users.
It seems like we could form some use cases for word segmentation:
- When you want real words (e.g., spell checker)
- When you want smaller tokens (e.g., cursor positioning)
I think it would be reasonable to spell that out in the spec. Pick one as the default behavior and make the other one configurable with an API setting.
For the record, ICU4X word segmenter produces the same output as ICU.
Please
P.T.O
123
times
123.3242432
at
12
43
PM
on
21
03
2013
or
example
gmail.com
www.google.com
and
she's
there
and
well
played
The code that generates the result. (The segment starts with a punctuation are removed.)
let s = "Please P.T.O 123 times. 123.3242432 at 12:43 PM on 21/03/2013. or [email protected]. www.google.com and she's there and well-played.";
let provider = RuleBreakDataProvider;
let segmenter = WordBreakSegmenter::try_new(&provider).expect("Data exists");
let breakpoints: Vec<usize> = segmenter.segment_str(s).collect();
for w in breakpoints.windows(2) {
let begin = w[0];
let end = w[1];
if !s[begin..end].starts_with(&[' ', ':', '@', '.', '/', '-']) {
println!("{}", &s[begin..end]);
}
}
I think it would be reasonable to spell that out in the spec. Pick one as the default behavior and make the other one configurable with an API setting.
The problem is that there are at least a dozen similar edge cases—domain names, email addresses, IP addresses, ~~URIs~~ IRIs, hashtags, @-references, Markdown/bbcode/etc. formatting, Latin abbreviations, hyphenated compounds, dates/times/datetimes, emoticons/kaomoji, etc.
"Long" vs. "short" seems too coarse, and a collection of fine-tuning options seems even worse.
I agree that we don't want an explosion of options for each edge case. I think we could look at use cases, though, and center options around those use cases. We already have "word", "sentence", and "grapheme" (and perhaps "line" at some point); we could add a new one called "token", for example, and say that "word" should be full words only, and "token" can produce smaller tokens.
That seems like the kind of thing that should be pushed for in Unicode so that ECMA-402 can adopt it as a downstream consumer.
Are some of these things that some kind of independent recognizer should find? I. E. URLs, email, hashtags etc. maybe you want some preprocess operation that detects such and replaces them for processing. Then you could add Btc hash, git hash, or anything new that comes along.
Markdown and bbcode would fit in this category also, and wouldn't make sense for a general purpose plain text segménter
From a web compatibility perspective, it would be bad if a product had a really good spell checker implementation when backed by V8 due to custom Intl.Segmenter changes, but then other engines had a bad spell checker experience due to differences in platform implementation.
@aethanyc my understanding was that lwbrk (current Gecko's layout segmenter) does special things for web-compat segmentation (@ sign, :// etc.) - I'm not sure if it's specific to CJK or all around. In the original pitch, we talked about upstreaming those customizations to ICU4X Segmenter either as CLDR or as a "web-mode" overlay. Can you comment on that?
TG2 discussion: https://github.com/tc39/ecma402/blob/master/meetings/notes-2022-03-17.md#intlsegmenter-with-urls-email-addresses-and-acronyms
Conclusion:
- FYT to investigate the V8 change
- Come back to this group if necessary
We would also like to push for an improvement in https://www.unicode.org/reports/tr29/#Word_Boundary_Rules that provides an example above WB6 similar to the one above WB8, but don't know how to pursue that.
-Do not break letters across certain punctuation.
+Do not break letters across certain punctuation, such as within “e.g” or “example.com”.
Another TC39-TG2 discussion: https://github.com/tc39/ecma402/blob/master/meetings/notes-2022-06-16.md#intlsegmenter-with-urls-email-addresses-and-acronyms-656
We agree that this is a bug in Chrome, but there could be phrasing improvements in UTS-35 as well as additional tests in Test262.
Another TC39-TG2 discussion: https://github.com/tc39/ecma402/blob/master/meetings/notes-2022-06-16.md#intlsegmenter-with-urls-email-addresses-and-acronyms-656
We agree that this is a bug in Chrome, but there could be phrasing improvements in UTS-35 as well as additional tests in Test262.
I wrapped up some basic tests in https://github.com/tc39/test262/pull/3577/files
And found several potential issues due to not having specific rules for word segmentation mostly with email addresses and some results running it in different browsers + ICU were different from what I expected as a user.
Example :
"[email protected] sending message to [email protected]"
[Log] my
[Log] @
[Log] mail.com
[Log]
[Log] sending
[Log]
[Log] message
[Log]
[Log] to
[Log]
[Log] my
[Log] @
[Log] mail.org
What are the next steps? IMHO I believe we should extend the recommendations on how to deal with this segmentation/ word Boundary that covers at least the popular use cases like emails.
@macchiati says to fill in the form at https://corp.unicode.org/reporting/error.html with suggestions.
He says that it may be useful to investigate the @ symbol. He says:
BTW, @ might be a fairly clear case. In the Olden Days, you'd only use @ in cases like "3 pieces @ $15 each", but nowadays by far the most prominent cases are in email or tags (@sffc). So we could consider proposing that something like that the following shouldn't break in word segmentation.
Letter Digit* @ and @ Letter
(off the top of my head; would have to look at the details)
Additional discussion on 2022-07-07: https://github.com/tc39/ecma402/blob/master/meetings/notes-2022-07-07.md#intlsegmenter-with-urls-email-addresses-and-acronyms
There a lot to unpack in that conversation. I'm OOO, but one point to add is that UAX#29 and UAX#14 both specify precise default algorithms, but do and need to allow for a great deal of customization for different environments, different languages, etc.
We should discuss what could to done help with what tc39 is doing, perhaps with some explicit profiles or som other mechanism.
On Thu, Jul 7, 2022 at 5:39 PM Shane F. Carr @.***> wrote:
Additional discussion on 2022-07-07: https://github.com/tc39/ecma402/blob/master/meetings/notes-2022-07-07.md#intlsegmenter-with-urls-email-addresses-and-acronyms
— Reply to this email directly, view it on GitHub https://github.com/tc39/ecma402/issues/656#issuecomment-1178373084, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMEEEYSYT7NKJL2VKM3VS5TCPANCNFSM5PSTOEHA . You are receiving this because you were mentioned.Message ID: @.***>
Additional point - icu4x segmenter architecture diverged from icu4c as a result of the design decisions driven by gecko layout segmentation needs to allow for runtime switches and overlays allowing for cheap context switching mid segmentation. This is something that icu4c requires build time dictionary/rules rebuild, and icu4c can do in fly.
@zbraniecki did you miss icu4x in the last comment?
@sffc after being so pushy during the monthly meeting, I was scrolling through tests today with @romulocintra and realized that we have tests for locale-specific behavior, for example the output of a NumberFormat#format operation for scientific notation in German.
This challenges my assumption that we should only test specified behavior in test262 and based on that precedent, I'll be quite alright with including the tests in test262. Apologies for being so assertive about this matter without finishing my research.
CLDR issue: https://unicode-org.atlassian.net/browse/CLDR-15767
Another: https://unicode-org.atlassian.net/browse/CLDR-15839