pySBD
pySBD copied to clipboard
🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
Hello @nipunsadvilkar , Thank you for your efforts to port Ruby library to Python. Do you see any benefit it to port JavaScript (node.js) library as well? And I wonder...
**Describe the bug** The requirement for spaCy 2.1.8 should be made more explicit (e.g., in new [requirements.txt](https://github.com/nipunsadvilkar/pySBD/blob/master/requirements.txt)). Currently, this is only in the benchmarking requirements (e.g., [requirements-benchmark.txt](https://github.com/nipunsadvilkar/pySBD/blob/master/requirements-benchmark.txt)). **To Reproduce** Steps...
**Describe the bug** Segmenter will raise "exception: bad escape (end of pattern) at position" when it is initialized with clean=True and it encounters a sentence like "etc.Png,Jpg,.\\" (word/token that contains...
**Describe the bug** When using Chinese to quote other people's words, there will be wrong sentence segmentation results. **To Reproduce** Steps to reproduce the behavior: Input text - '"这是对的。"他说。' **Expected...
Thanks for creating this! Just wanted to know, how does this library compare against ICU's implementation of Text Segmentation-based sentence splitting for all supported languages? https://polyglot.readthedocs.io/en/latest/Tokenization.html
**Describe the bug** The example script is currently broken with the current latest versions of spacy and pysbd. Adding a pipe to spacy model throws an exception. Moreover, sentences are...
Input ``` The generalized Li coefficients. The convergence of the sum (2) defining the generalized Li coefficients associated to the function F ∈ S ♯♭ 0 is proved in [28]...
**Describe the bug** Segmenter will hang if it encounters unfinished html ( unescaped html attribute with unfinished hml tag). The reason is the regex taht can be found in rules.py,...
The below seems to hang forever- ```python segmenter = pysbd.Segmenter(language="en", clean=False) text = "..[111 111 111 111 111 111 111 111 111 111]" segmenter.segment(text) ``` Interrupting I get the traceback:...
Hi! Thanks for this library. Since there is no notion of documents in the OPUS-100 dataset it is not clear to me how accuracy is computed. I tried a naive...