icu4x Segmenter

Segmenter

Open zbraniecki opened this issue 4 years ago • 16 comments

For Segmenter, seems like the first crate we could consider incorporating is https://github.com/makotokato/uax14_rs

@sffc @Manishearth @makotokato - would it make sense to consider it for ICU4X?

May 29 '20 19:05 zbraniecki

Yes. We might want to consider unicode_segmentation for non-line segmentation.

Neither of these are locale aware, however.

May 29 '20 22:05 Manishearth

Intl.Segmenter is in scope for 402, and I think it makes sense to include here in ICU4X as well. I think we should build the segmenter based on the unicode properties API that ICU4X provides.

Jun 02 '20 03:06 sffc

Adding to backlog along with Normalization (#40).

Jun 04 '20 18:06 sffc

@methyl compiled ICU4C Segmenter to WebAssembly as a polyfill for Intl.Segmenter:

https://github.com/surferseo/intl-segmenter-polyfill

According to https://github.com/tc39/proposal-intl-segmenter/issues/118#issuecomment-645238037, the .wasm file is ~350 KB gzipped, with the Thai dictionary but not the other dictionaries.

Just posting this here as some benchmarks to consider for the Rust segmenter.

Jun 17 '20 10:06 sffc

@sffc thanks for mention, if you have any questions about the implementation, go ahead. I also explored compiling https://github.com/google/rust_icu to WASM but it was not straightforward and I gave up.

Jun 18 '20 09:06 methyl

Which segmentation functionality exactly? There are several types desirable:

line breaking is covered by the mentioned crate; also by https://crates.io/crates/xi-unicode (code)
word breaking, useful e.g. to select a whole word; this is covered by https://crates.io/crates/unicode-segmentation
word advance forward/back (result of Ctrl+Left/Right in a text editor); I believe this is not yet covered (https://github.com/unicode-rs/unicode-segmentation/issues/83)

cc @raphlinus

Sep 05 '20 19:09 dhardy

Re line segmentation, I have no particular ego invested, but it is true as @dhardy points out that xi-unicode has a solution to this problem. It's a very optimized, hand-tuned solution. I haven't benchmarked it against uax14_rs, but did benchmark it against ICU a long time ago and it was considerably faster (somewhere in the 2x-3x range, I don't remember the details). I would be pleased and honored if it were adapted for use in icu4x, but relieved if not; I'm busy enough as it is.

There's lots more to line breaks than a UAX 14 compliant break iterator. There's all the southeast Asian dictionary-based breaking stuff, plus other desirable "tailoring." For example, Android has a little parser for url and email addresses, and applies different breaking rules for those, because the default UAX 14 behavior is pretty bad in those use cases. I should also point out that ICU has some really complex additional behavior around numbers and ASCII symbols, not covered by UAX 14. I evaluated it when working on xi-unicode and found it not worthwhile, but others with different use cases may feel differently.

So whoever adopts this will have their hands full, if the goal is a world-class solution.

The Druid text stack currently does nothing sophisticated wrt word advance, but is likely a customer for WordCursor functionality.

Sep 05 '20 19:09 raphlinus

I'll also add that xi-unicode goes to Unicode 10, while uax14_rs seems to be up to date with Unicode 13. That's another consideration, and makes apples-to-apples comparison harder, as Unicode 13 is substantially more complex.

Lastly, xi-unicode has support for break iteration in non-contiguous strings, to support the needs at the time of xi-editor. I think that's probably worth removing going forward; my current thinking is that even when dealing with very large texts, it makes sense to use a contiguous string representation at the paragraph granularity and below. The maintenance burden of keeping the noncontiguous representation up to date is nontrivial.

Sep 05 '20 20:09 raphlinus

One of the main value propositions of ICU4X is to provide Unicode and CLDR data (#217 is a thread specifically about Unicode character property data). In cases where functionality is out of scope for ICU4X, I hope that downstream crates can depend on the ICU4X data provider to get their character properties.

For segmentation in particular, as others have mentioned, there is already a strong ecosystem in Rust. I don't think we should reinvent the wheel in ICU4X.

There is one area where I'm not aware of strong ecosystem support, which is dictionary-based segmentation (important in many East and Southeast Asian languages). I'm hosting an intern this fall who I hope will be able to build a segmenter for these languages.

Sep 06 '20 05:09 sffc

Hi all, I have a document for implementing segmenters in icu4x. Feedback very welcome. https://docs.google.com/document/d/1ojrOdIchyIHYbg2G9APX8j2p0XtmVLj0f9jPIbFYVUE/edit

cc @jfkthame

Feb 22 '21 19:02 aethanyc

CC @nathanhammond

Feb 22 '21 20:02 sffc

@aethanyc 👋

I'm intending to review in close detail your document. I'm a Cantonese speaker and have implemented a bad trie version of Cantonese segmentation for JS (using the 402 API) which is still immensely better than ICU's BreakIterator with the default dictionary.

Apropos of literally nothing in your document, since I've only skimmed it at this point, here is a thread where I've been discussing segmentation API limitations for Cantonese:

https://github.com/tc39/proposal-intl-segmenter/issues/133

I'm looking forward to learning from your research!

Feb 28 '21 19:02 nathanhammond

Implementation feedback from @aheninger:

Here are a few followup comments to the ICU4X deep dive discussion of April 21.

Testing. For evaluating uax14_rs (or other new implementations), the Unicode segmentation test data at https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/LineBreakTest.txt is very useful. Writing code to parse these text files and drive a test is pretty simple, and will give a quick read on whether an implementation is basically on track.

Little was said about dictionary based word segmentation. Does ICU4X have a strategy separate from classic ICU? How does the LSTM based thing Frank is working on fit in?

In ICU, the transitions between text regions requiring rule based and dictionary based breaking is an area with many unaddressed bugs. Something to keep in mind while doing a new implementation

Sequential vs. Parallel application of rules, and the speed vs. complexity tradeoff. While slower, if an engine based on sequential application of the rules turned out to be good enough, the rule maintenance problem would be greatly simplified. I think this would be true even if adding a rule customization involved writing code, or running a tool that produced some code.

A good performing sequential-rule engine would probably want to be based on rules that describe how to move from one boundary to the next, rather than on UAX style rules that describe how to test an arbitrary text position for being a boundary. But the transformation between the two styles is much easier than the sequential to parallel transformation involved in maintaining the ICU rules.

Monkey Testing. In ICU, this has proved to be by far the most effective way to find obscure corner case bugs in the rules. It relies on having a reference implementation of the defining UAX rules that closely follows the UAX breaking algorithm, and is used to generate expected results for random test data.

The random test text sequences are not completely random. Start with a list of all of the character classes that appear in the rules. When building a test string, for each character to be added, first pick a random character class, then pick a random character from within the class.

I would be very hesitant to release a segmentation implementation without having monkey-tested it. Unfortunately, a good monkey test is quite a bit more work to write than a test driver for the Unicode test data, #1 above. Do #1 first.

May 04 '21 20:05 sffc

See some benchmark results from @aethanyc in https://github.com/unicode-org/icu4x/issues/707.

May 07 '21 17:05 sffc

I did some quick informal performance measurements of ICU4C vs uax14_rs on English text, and came up with similar results to those reported above, in #707. ICU is slower, running at around 60% the speed of uax14_rs.

My suspicion is that part of the difference comes from a ICU's caching layer. ICU's API supports forwards and backwards iteration, and testing arbitrary string positions, all from the same iterator. Caching helps some use cases, but not plain forward iteration. The work per character for forward iteration of the underlying engine looks like it should be comparable for ICU and uax14_rs packages. All this is just hunch - without some real profiling work, I don't really know and could easily be wrong.

As @dhardy and @raphlinus noted earlier, there is also xi-unicode/rust/unicode. The approach it takes is similar to that of uax14_rs - I wonder if there is some common ancestry between the two

In both xi-uniocde and uax14_rs, lists of states and character properties are hard coded; you can't make a word or grapheme cluster iterator by just plugging in different rules.

Also, both xi-editor and uax14_rs include tests to check the Unicode test file LineBreakTest.txt. uax14_rs is skipping tests that include the line break character classes OP, NU, PO and PR. For uax14_rs I confirmed that the test fails if these cases are added back in.

May 18 '21 00:05 aheninger

This will be closed after segmenter moves to components in #2259.

Jul 28 '22 17:07 sffc

Segmenter has been moved to components and is being released in 1.2. Closing this tracking issue.

Apr 13 '23 22:04 sffc

icu4x icu4x copied to clipboard

Segmenter

icu4x
icu4x copied to clipboard