Use "regex" library from PyPI to build GLOBAL_SENTENCE_TERMINATORS regex
Fixes #3
This is analogous to sentencex-js/pull/3 — both remove the terminators codepoint list module, and instead use an external library that tracks the Unicode Character Database.
Hi @divec, Sorry for delay in review. Holidays..
All the tests are passing with this pull request, but I noticed a performance regression. Running accuracy.py in the benchmarks folder of this repo gives the following measures
- Using
repackage and terminatorsy.py(master version) : 0.75 seconds (Avg over 100 runs ) - This pull request that uses
regexpackage : 1.83 seconds (Avg over 100 runs )
So that is more than two times the master version. Since the original issue is only with sentence terminator regex and not with other regex usage, I tried to apply a minimal patch:
diff --git a/sentencex/base.py b/sentencex/base.py
index a2218e6..e2c0bde 100644
--- a/sentencex/base.py
+++ b/sentencex/base.py
@@ -1,4 +1,5 @@
import re
+import regex
from typing import Dict, Iterator, Tuple
from .terminators import GLOBAL_SENTENCE_TERMINATORS
@@ -44,7 +45,7 @@ class Language(object, metaclass=Languages):
language = "base"
abbreviations: set = set()
- GLOBAL_SENTENCE_BOUNDARY_REGEX = re.compile(r"[%s]+" % "".join(GLOBAL_SENTENCE_TERMINATORS))
+ GLOBAL_SENTENCE_BOUNDARY_REGEX = regex.compile(r"\p{Sentence_Terminal}+")
quotes_regx_str = r"|".join([f"{left}(\n|.)*?{right}" for left, right in quote_pairs.items()])
quotes_regex = re.compile(r"%s+" % quotes_regx_str)
parens_regex = re.compile(r"([\((<{\[])(?:\\\1|.)*?[\)\]})]")
Then I got 1.37 seconds- which is better, but still a regression in performance.
Maintaining an in-library collection of these characters helped us to achieve a javascript counterpart. So the zero dependency for this library was an intentional decision. What if we fix the amaharic issue and the comment issue you pointed out and keep the zero dependency pattern as such?
The terminators collection was originally sourced from wikinlp tools repo. I re-ran unichars program and I don't see these Amharic/Ethiopic chars in it. That must be a descrepancy in the list.
https://gist.github.com/santhoshtr/af560fde46b7ee56ed14fd3f6614de01
However, sometimes we need additional characters. For example, horizontal ellipsis is not in the list.
Thanks @santhoshtr. If we maintain the list of terminators within sentencex, then we probably also need to maintain the list of close punctuation and spaces in order to expose boundary information:
| Value | Represents |
|---|---|
| SATerm | (STerm | ATerm) |
| STerm | Sentence_Terminal = Yes |
| ATerm | U+002E ( . ) FULL STOP U+2024 ( ․ ) ONE DOT LEADER U+FE52 ( ﹒ ) SMALL FULL STOP U+FF0E ( . ) FULLWIDTH FULL STOP |
| Close | General_Category = Open_Punctuation, or General_Category = Close_Punctuation, or Line_Break = Quotation and not U+05F3 ( ׳ ) HEBREW PUNCTUATION GERESH and ATerm = No and STerm = No |
| Sp | U+0085 NEXT LINE (NEL) U+2028 LINE SEPARATOR U+2029 PARAGRAPH SEPARATOR |
The master copy of these lists is available in https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/SentenceBreakProperty.txt , which we pull into UnicodeJS using an extraction script that might be useful to borrow from, if you decide that maintaining these lists within sentencex is worthwhile for the performance boost.
The terminators collection was originally sourced from wikinlp tools repo. I re-ran unichars program and I don't see these Amharic/Ethiopic chars in it. That must be a descrepancy in the list.
https://gist.github.com/santhoshtr/af560fde46b7ee56ed14fd3f6614de01
However, sometimes we need additional characters. For example, horizontal ellipsis is not in the list.
Ah ok, understood. I wonder if the wikinlp version got partially corrupted due to encoding errors? The new version in the commit you added looks a lot less strange!
I note the new code in the commit adds U+3002 IDEOGRAPHIC FULL STOP manually; a related character that seems to be missing is U+FF61 HALFWIDTH IDEOGRAPHIC FULL STOP. I wonder why these are missing from the perl script output? Maybe they omit ideographic characters to save memory or something?
U+2026 HORIZONTAL ELLIPSIS is not actually listed as STerm in the Unicode Character Database. Did we make a heuristic decision to treat it as a sentence terminator nevertheless? If we really want to include it, then the manual entry should actually use "\u2026" in the string (right now the string is equal to "\u002E\u002E\u002E" i.e. three Basic Latin full stops).