sentencex Use "regex" library from PyPI to build GLOBAL_SENTENCE

Fixes #3

Oct 22 '23 08:10 divec

This is analogous to sentencex-js/pull/3 — both remove the terminators codepoint list module, and instead use an external library that tracks the Unicode Character Database.

Oct 22 '23 09:10 divec

Hi @divec, Sorry for delay in review. Holidays..

All the tests are passing with this pull request, but I noticed a performance regression. Running accuracy.py in the benchmarks folder of this repo gives the following measures

Using re package and terminatorsy.py(master version) : 0.75 seconds (Avg over 100 runs )
This pull request that uses regex package : 1.83 seconds (Avg over 100 runs )

So that is more than two times the master version. Since the original issue is only with sentence terminator regex and not with other regex usage, I tried to apply a minimal patch:

diff --git a/sentencex/base.py b/sentencex/base.py
index a2218e6..e2c0bde 100644
--- a/sentencex/base.py
+++ b/sentencex/base.py
@@ -1,4 +1,5 @@
 import re
+import regex
 from typing import Dict, Iterator, Tuple
 
 from .terminators import GLOBAL_SENTENCE_TERMINATORS
@@ -44,7 +45,7 @@ class Language(object, metaclass=Languages):
 
     language = "base"
     abbreviations: set = set()
-    GLOBAL_SENTENCE_BOUNDARY_REGEX = re.compile(r"[%s]+" % "".join(GLOBAL_SENTENCE_TERMINATORS))
+    GLOBAL_SENTENCE_BOUNDARY_REGEX = regex.compile(r"\p{Sentence_Terminal}+")
     quotes_regx_str = r"|".join([f"{left}(\n|.)*?{right}" for left, right in quote_pairs.items()])
     quotes_regex = re.compile(r"%s+" % quotes_regx_str)
     parens_regex = re.compile(r"([\(（<{\[])(?:\\\1|.)*?[\)\]}）]")

Then I got 1.37 seconds- which is better, but still a regression in performance.

Maintaining an in-library collection of these characters helped us to achieve a javascript counterpart. So the zero dependency for this library was an intentional decision. What if we fix the amaharic issue and the comment issue you pointed out and keep the zero dependency pattern as such?

Oct 30 '23 11:10 santhoshtr

The terminators collection was originally sourced from wikinlp tools repo. I re-ran unichars program and I don't see these Amharic/Ethiopic chars in it. That must be a descrepancy in the list.

https://gist.github.com/santhoshtr/af560fde46b7ee56ed14fd3f6614de01

However, sometimes we need additional characters. For example, horizontal ellipsis is not in the list.

Oct 30 '23 11:10 santhoshtr

Thanks @santhoshtr. If we maintain the list of terminators within sentencex, then we probably also need to maintain the list of close punctuation and spaces in order to expose boundary information:

Value	Represents
SATerm	(STerm \| ATerm)
STerm	Sentence_Terminal = Yes
ATerm	U+002E ( . ) FULL STOP U+2024 ( ․ ) ONE DOT LEADER U+FE52 ( ﹒ ) SMALL FULL STOP U+FF0E ( ． ) FULLWIDTH FULL STOP
Close	General_Category = Open_Punctuation, or General_Category = Close_Punctuation, or Line_Break = Quotation and not U+05F3 ( ׳ ) HEBREW PUNCTUATION GERESH and ATerm = No and STerm = No
Sp	U+0085 NEXT LINE (NEL) U+2028 LINE SEPARATOR U+2029 PARAGRAPH SEPARATOR

The master copy of these lists is available in https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/SentenceBreakProperty.txt , which we pull into UnicodeJS using an extraction script that might be useful to borrow from, if you decide that maintaining these lists within sentencex is worthwhile for the performance boost.

Nov 01 '23 03:11 divec

The terminators collection was originally sourced from wikinlp tools repo. I re-ran unichars program and I don't see these Amharic/Ethiopic chars in it. That must be a descrepancy in the list.

https://gist.github.com/santhoshtr/af560fde46b7ee56ed14fd3f6614de01

However, sometimes we need additional characters. For example, horizontal ellipsis is not in the list.

Ah ok, understood. I wonder if the wikinlp version got partially corrupted due to encoding errors? The new version in the commit you added looks a lot less strange!

I note the new code in the commit adds U+3002 IDEOGRAPHIC FULL STOP manually; a related character that seems to be missing is U+FF61 HALFWIDTH IDEOGRAPHIC FULL STOP. I wonder why these are missing from the perl script output? Maybe they omit ideographic characters to save memory or something?

U+2026 HORIZONTAL ELLIPSIS is not actually listed as STerm in the Unicode Character Database. Did we make a heuristic decision to treat it as a sentence terminator nevertheless? If we really want to include it, then the manual entry should actually use "\u2026" in the string (right now the string is equal to "\u002E\u002E\u002E" i.e. three Basic Latin full stops).

Nov 02 '23 12:11 divec

Use "regex" library from PyPI to build GLOBAL_SENTENCE_TERMINATORS regex