icu4x
icu4x copied to clipboard
Support adjusting break iterator position by an offset
Per discussion in https://github.com/unicode-org/icu4x/discussions/3231, the iterator [^1] created by segmenters probobly should have APIs to adjust the current iterator position to the breakpoint preceding or following a given offset. ICU4C has APIs ubrk_preceding and ubrk_following for such purposes.
These API can help Javascript engines to implement Segments.prototype.containing() [^2]. Here is the v8 Segments.prototype.containing() for reference.
cc @sffc @makotokato
[^1]: RuleBreakIterator and LineBreakIterator
[^2]: Spec https://tc39.es/ecma402/#sec-%segmentsprototype%.containing
@eggrobin has some thoughts on this.
@macchiati and @markusicu will have more context, but ICU4[CJ] has to use a very different state machine in order to provide efficient random access segmentation.
See, e.g., https://github.com/unicode-org/icu/blob/main/icu4c/source/data/brkitr/rules/line.txt.
This month @anba landed Intl.Segmenter in Firefox based on the ICU4X Segmenter impl, reviewed by @dminor
https://phabricator.services.mozilla.com/D195803
I had been under the impression that Intl.Segmenter was not implementable without support for random access in order to implement the containing() function. It looks like @anba's implementation loops from the start of the string and repeatedly calls next() until we reach the index. While this strategy gets the job done, I'm concerned about the performance of this with large strings where we need to reach an index deep into the string. I therefore hope that we can continue to prioritize this issue on the basis of 402 compatibility.