pySBD icon indicating copy to clipboard operation
pySBD copied to clipboard

Arabic sentence split on the Arabic comma

Open ymoslem opened this issue 2 years ago • 0 comments

Describe the bug Arabic sentence split on the Arabic comma.

To Reproduce Steps to reproduce the behavior:

import pysbd
text = "هذه تجربة، للغة العربية"
seg = pysbd.Segmenter(language="ar", clean=True)
>>> print(seg.segment(text))

Output: ['هذه تجربة،', 'للغة العربية']

Expected behavior The text should not be split on the Arabic comma. Expected output: ['هذه تجربة، للغة العربية']

Additional context I locally fixed it by modifying the file: pysbd/lang/arabic.py, deleting ، from SENTENCE_BOUNDARY_REGEX.

ymoslem avatar May 18 '22 17:05 ymoslem