icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Support adjusting break iterator position by an offset

Open aethanyc opened this issue 2 years ago • 5 comments

Per discussion in https://github.com/unicode-org/icu4x/discussions/3231, the iterator [^1] created by segmenters probobly should have APIs to adjust the current iterator position to the breakpoint preceding or following a given offset. ICU4C has APIs ubrk_preceding and ubrk_following for such purposes.

These API can help Javascript engines to implement Segments.prototype.containing() [^2]. Here is the v8 Segments.prototype.containing() for reference.

cc @sffc @makotokato

[^1]: RuleBreakIterator and LineBreakIterator [^2]: Spec https://tc39.es/ecma402/#sec-%segmentsprototype%.containing

aethanyc avatar Apr 04 '23 02:04 aethanyc

@eggrobin has some thoughts on this.

sffc avatar May 23 '23 15:05 sffc

@macchiati and @markusicu will have more context, but ICU4[CJ] has to use a very different state machine in order to provide efficient random access segmentation.

See, e.g., https://github.com/unicode-org/icu/blob/main/icu4c/source/data/brkitr/rules/line.txt.

eggrobin avatar May 23 '23 15:05 eggrobin

This month @anba landed Intl.Segmenter in Firefox based on the ICU4X Segmenter impl, reviewed by @dminor

https://phabricator.services.mozilla.com/D195803

I had been under the impression that Intl.Segmenter was not implementable without support for random access in order to implement the containing() function. It looks like @anba's implementation loops from the start of the string and repeatedly calls next() until we reach the index. While this strategy gets the job done, I'm concerned about the performance of this with large strings where we need to reach an index deep into the string. I therefore hope that we can continue to prioritize this issue on the basis of 402 compatibility.

sffc avatar Dec 14 '23 20:12 sffc