icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Add support for sentence break suppression (`-u-ss`)

Open sffc opened this issue 2 years ago • 3 comments

It is in the Unicode UTS 35 spec, and there is a proposal to add it to ECMA-402. We should support it in ICU4X.

sffc avatar Aug 23 '23 23:08 sffc

Assigning to @eggrobin since you're already in the thick of sentence segmentation.

sffc avatar Sep 21 '23 17:09 sffc

To be clear, we're talking about this data:

https://github.com/unicode-org/cldr/blob/main/common/segments/en.xml

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE ldml SYSTEM "../../common/dtd/ldml.dtd">
<ldml>
  <identity>
    <version number="$Revision$"/>
    <language type="en"/>
  </identity>
  <segmentations>
    <segmentation type="SentenceBreak">
      <!--From ULI data, http://uli.unicode.org-->
      <suppressions type="standard">
        <suppression>L.P.</suppression>
        <suppression>Alt.</suppression>
        <suppression>Approx.</suppression>
        <suppression>E.G.</suppression>
        <suppression>O.</suppression>
        <suppression>Maj.</suppression>
        <suppression>Misc.</suppression>

sffc avatar Sep 18 '24 00:09 sffc

CC @makotokato

If ICU4C has a trie, you could re-use it. Else, it's perfectly fine to build a trie in ICU4X datagen. You can use zerotrie::ZeroTriePerfectHash, for example.

sffc avatar Sep 18 '24 00:09 sffc