pragmatic_segmenter icon indicating copy to clipboard operation
pragmatic_segmenter copied to clipboard

Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.

Results 33 pragmatic_segmenter issues
Sort by recently updated
recently updated
newest added

Is there a way to retain/return the white-spacing that was removed between sentences? I am looking for a way to detect paragraphs breaks (essentially two newlines).

German contains various Abbreviations which might be written including a space, for instance `zum Beispiel` is commonly abbreviated with `z. B.` including a space between both parts. Is this something...

Here is the relevant part of the stack trace: ``` SyntaxError: ~/.rvm/gems/jruby-9.3.2.0/gems/pragmatic_segmenter-0.3.22/lib/pragmatic_segmenter/list.rb:32: ASCII-8BIT mixed within UTF-8 source /\s\d{1,2}(?=\.\s)|^\d{1,2}(?=... ^ require at org/jruby/RubyKernel.java:1017 ... ``` Thanks for any help with this

Would adding 'in.' as an abbreviation for inches break the test suite in many cases?

I noticed that `à` seems not to be detected as a lower-case letter after an abbreviation: ``` assert_equal 1, segment("85,7 cm (33 3/4 po) min. à 88,9 cm (35 po)...

I stumbled upon the following case where (the otherwise wonderful) PragmaticSegmenter trips up: It will split a sentence containing a telephone number with letter characters `800.ACME.NOW` is split after `800.`:...

Hello, I have been using pragmatic segmenter by following the steps below: sudo apt-get install ruby-full gem install pragmatic_segmenter And after install the pragmatic_segmenter I got this: Successfully installed pragmatic_segmenter-0.3.22...

HI, In french we have a ... at the end of sentence but here it doesn't segment right I think it's because etc is also an abreviation that is written...

`replace_parens_in_numbered_list()` calls `scan_lists()` twice with same paramters. I checked the [commit] which introduced the duplication and it looks like a mistake. https://github.com/diasks2/pragmatic_segmenter/blob/1ade491c81f9d1d7fb3abd4c1e2e266fa5b34c42/lib/pragmatic_segmenter/list.rb#L100-L104 ###### Reference - https://github.com/diasks2/pragmatic_segmenter/commit/c5edc452d3ee18c08c8e91f8108075078edc51e1#diff-8d1d0f65d5ba9ff24847dd238f3db5e7R103 [commit]: https://github.com/diasks2/pragmatic_segmenter/commit/c5edc452d3ee18c08c8e91f8108075078edc51e1#diff-8d1d0f65d5ba9ff24847dd238f3db5e7R103

The text in question: > On April 11, our friends at the Financial Times' \"Alphachat\" podcast invited THE INDICATOR to host a panel at a bar in Washington, D.C. The...