icu4x
icu4x copied to clipboard
Add support for sentence break suppression (`-u-ss`)
It is in the Unicode UTS 35 spec, and there is a proposal to add it to ECMA-402. We should support it in ICU4X.
Assigning to @eggrobin since you're already in the thick of sentence segmentation.
To be clear, we're talking about this data:
https://github.com/unicode-org/cldr/blob/main/common/segments/en.xml
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE ldml SYSTEM "../../common/dtd/ldml.dtd">
<ldml>
<identity>
<version number="$Revision$"/>
<language type="en"/>
</identity>
<segmentations>
<segmentation type="SentenceBreak">
<!--From ULI data, http://uli.unicode.org-->
<suppressions type="standard">
<suppression>L.P.</suppression>
<suppression>Alt.</suppression>
<suppression>Approx.</suppression>
<suppression>E.G.</suppression>
<suppression>O.</suppression>
<suppression>Maj.</suppression>
<suppression>Misc.</suppression>
CC @makotokato
If ICU4C has a trie, you could re-use it. Else, it's perfectly fine to build a trie in ICU4X datagen. You can use zerotrie::ZeroTriePerfectHash, for example.