pragmatic_segmenter
pragmatic_segmenter copied to clipboard
Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.
My sentence `‘What’s the matter, Jerry?’ called Mr Lorry.` is segmented into two parts. - `‘What’s the matter, Jerry?` - `’ called Mr Lorry.`
Can you list all the supported languages? It would be helpful to know if I were to use this in a project.
I've been testing the ellipsis rules with . . . replaced with U+2026 (…) and find that pragmatic segmenter fails when given the actual ellipsis character. I'm probably missing something...
See example below, when 'clean' parameter is 'false', the asterisk after cat is still removed ``` pry(main)> s = "I am a dog. Cat.*" => "I am a dog. Cat.*"...
I'm trying to test this library against some larger english corpora but I'm running into trouble aligning the results back to the original text. Even with "clean" turned off, the...
Hi, When I use this great tool for preprocessing wikipedia dumps, I encountered the infinite loop and failed with NoMemoryError. Example: When we input > '' (a '\0 !\0') with...
It seems that pragmatic segmenter does not correctly split Spanish text when there is no space after a period, e.g. "Hola señorita.Espero que muy bien." This works with language: 'en'...
I'd like to use the segmenter with a large Java application and would prefer not to use JRuby, etc. I'd prefer to run the segmenter as a light-weight JSON service....
I have a corpus of text that often uses explicit non-breaking spaces (NBSP, U+00A0). They are mainly used to keep together words in the same sentence. They often appear after...