icu icon indicating copy to clipboard operation
icu copied to clipboard

ICU-22789 Add Segmenter API to conveniently wrap BreakIterator in ICU4J

Open echeran opened this issue 1 year ago • 1 comments

In order to "modernize" the BreakIterator API, this PR introduces a new wrapper using a more convenient, modern API design around a Segmenter interface.

A few of the goals that motivate the new Segmenter API:

  • Use newer Java features from Java 8 that support the Stream API which underlies a functional programming style
  • Create instances that are immutable (reduces complexity borne of statefulness; allows user code to be more referentially transparent)
  • Create a wrapper class around the iteration. This allows the decoupling of the iteration of a source string from the construction of the BreakIterator such that we can perform iteration over one string in isolation from other strings
  • Use interfaces to properly decouple and abstract. APIs built on top of interfaces can allow user-created implementations to participate in such higher level APIs.

More details in the design doc.

This PR will focus on the ICU4J side of the work.

Checklist

  • [X] Required: Issue filed: https://unicode-org.atlassian.net/browse/ICU-22789
  • [X] Required: The PR title must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • [X] Required: The PR description must include the link to the Jira Issue, for example by completing the URL in the first checklist item
  • [X] Required: Each commit message must be prefixed with a JIRA Issue number. Example: "ICU-1234 Fix xyz"
  • [X] Issue accepted (done by Technical Committee after discussion)
  • [X] Tests included, if applicable
  • [ ] API docs and/or User Guide docs changed or added, if applicable

echeran avatar Oct 08 '24 23:10 echeran

Looks great! One quibble.

Actually, one other observation. As things stand, Segmenter and its subclasses don't do much-- they just create Segments objects, which wrap BreakIterator objects that do all the work. The API is a lot cleaner and clearer, but the implementation isn't. I assume the plan in the future at some point is to move to an implementation where the Segmenter actually owns the state and category tables and the Segments object just handles iteration over a particular string? (I'm not saying you need to do this now; just clarifying that that's in the plan.)

richgillam avatar Apr 17 '25 20:04 richgillam

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot