icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Move Segmenter to Components

Open sffc opened this issue 1 year ago • 3 comments

Checklist for Segmenter in Components:

  • [ ] Apply the boilerplate
    • [x] Ensure everything is documented by adding back #[warn(missing_docs)] (#2366)
    • [ ] Add Clippy lints
      • [ ] https://github.com/unicode-org/icu4x/pull/2325
      • [x] https://github.com/unicode-org/icu4x/pull/2566
  • [x] Check the public API surface and fix issues:
    • [x] Don't re-export provider stuff at the top level (#2300)
    • [x] Make certain types such as DictionarySegmenter be pub(crate) (#2271)
    • [x] Hide the symbols module from the documentation (#2317)
  • [ ] Final sign-off from @aethanyc, @makotokato, and @sffc for feature coverage and an overall easy-to-use, polished class
  • [ ] Add to the icu meta-crate (after moving to components)

sffc avatar Jul 27 '22 00:07 sffc

I added one more small bullet point about the symbols module (CC @robertbastian)

sffc avatar Jul 29 '22 15:07 sffc

The symbols module is used in datagen.

https://github.com/unicode-org/icu4x/blob/1d4c6a146fb6fc88efb341a20cb06dc0dc2d42fd/provider/datagen/src/transform/segmenter/mod.rs#L17

I wonder whether it is possible to hide it from icu_segmenter's online doc, but let it remain accessible from icu_datagen?

aethanyc avatar Aug 02 '22 21:08 aethanyc

Yeah, okay, let's mark the module as #[doc(hidden)] (at the point it is pub use'd in lib.rs)

sffc avatar Aug 02 '22 21:08 sffc

Follow-up from #2716: in LineBreakIterator, the grapheme segmenter data should be pre-borrowed so that we don't need to re-borrow it each time. This is a known performance bottleneck that we don't want to propagate in new code.

sffc avatar Nov 02 '22 01:11 sffc

I've noticed that the public APIs eagerly return Vec<usize>s for segmentation points. Ideally we'd return some kind of lazy iterator. I don't think this is something we can implement for 1.2, but I'd like to change returns types to impl Iterator<Item = usize> so that we can do this in the future in a semver compatible way. Thoughts?

robertbastian avatar Apr 06 '23 10:04 robertbastian

I've noticed that the public APIs eagerly return Vecs for segmentation points.

Which user-facing public APIs return Vec<usize>? I found internal APIs such as https://github.com/unicode-org/icu4x/blob/3eff54d7cf654cb750e34ae8c84a17c9388aa468/experimental/segmenter/src/complex.rs#L187-L192 to be the possible candidates. Returning impl Iterator<Item = usize> seems nice if its caller can benefit from such change.

aethanyc avatar Apr 06 '23 17:04 aethanyc

Sorry I got confused by internal APIs that are marked pub.

robertbastian avatar Apr 07 '23 20:04 robertbastian