icu4x
icu4x copied to clipboard
Move Segmenter to Components
Checklist for Segmenter in Components:
- [ ] Apply the boilerplate
- [x] Ensure everything is documented by adding back
#[warn(missing_docs)]
(#2366) - [ ] Add Clippy lints
- [ ] https://github.com/unicode-org/icu4x/pull/2325
- [x] https://github.com/unicode-org/icu4x/pull/2566
- [x] Ensure everything is documented by adding back
- [x] Check the public API surface and fix issues:
- [x] Don't re-export
provider
stuff at the top level (#2300) - [x] Make certain types such as
DictionarySegmenter
bepub(crate)
(#2271) - [x] Hide the
symbols
module from the documentation (#2317)
- [x] Don't re-export
- [ ] Final sign-off from @aethanyc, @makotokato, and @sffc for feature coverage and an overall easy-to-use, polished class
- [ ] Add to the
icu
meta-crate (after moving to components)
I added one more small bullet point about the symbols
module (CC @robertbastian)
The symbols
module is used in datagen.
https://github.com/unicode-org/icu4x/blob/1d4c6a146fb6fc88efb341a20cb06dc0dc2d42fd/provider/datagen/src/transform/segmenter/mod.rs#L17
I wonder whether it is possible to hide it from icu_segmenter
's online doc, but let it remain accessible from icu_datagen
?
Yeah, okay, let's mark the module as #[doc(hidden)]
(at the point it is pub use
'd in lib.rs)
Follow-up from #2716: in LineBreakIterator, the grapheme segmenter data should be pre-borrowed so that we don't need to re-borrow it each time. This is a known performance bottleneck that we don't want to propagate in new code.
I've noticed that the public APIs eagerly return Vec<usize>
s for segmentation points. Ideally we'd return some kind of lazy iterator. I don't think this is something we can implement for 1.2, but I'd like to change returns types to impl Iterator<Item = usize>
so that we can do this in the future in a semver compatible way. Thoughts?
I've noticed that the public APIs eagerly return Vec
s for segmentation points.
Which user-facing public APIs return Vec<usize>
? I found internal APIs such as https://github.com/unicode-org/icu4x/blob/3eff54d7cf654cb750e34ae8c84a17c9388aa468/experimental/segmenter/src/complex.rs#L187-L192 to be the possible candidates. Returning impl Iterator<Item = usize>
seems nice if its caller can benefit from such change.
Sorry I got confused by internal APIs that are marked pub
.