pragmatic_segmenter
pragmatic_segmenter copied to clipboard
Pragmatic Segmenter is a rule-based sentence boundary detection gem that works out-of-the-box across many languages.
hi, we have used your segmenter to deal with very big corpus(wiki dump) with size 320MB, it is written in Kazakh but the segmenter going to segment a very very...
doc_type
What kind of `doc_types` are supported? I have tried `html`, but it is not working.
tatar.rb
Add an example for Tatar with abbreviations.
Would it be possible to get instructions for how to use this on the command line in a pipe? e.g. ``` $ cat ~/corpora/languages/tatar/wikipedia/wiki.txt | ruby pragmatic_segmenter.rb ``` This gives...
`Pragmatic_Segmenter` should be able to return segments of sentences of a maximum size. E.g. https://github.com/akalsey/textchunk https://github.com/algolia/chunk-text The following code example is donated by https://auditus.cc courtesy of @Havenwood. It is used...
Currently pragmatic_segmenter returns an instance of `PragmaticSegmenter::Text`, which is a subclass of `String`. As pragmatic_tokenizer checks if `text.class == String` and also returning segmented objects of a different class than...
I just realised that for an `array` with 1000 strings with each 50-300 chars length (url titles and description generated by [gottfrois/link_thumbnailer](https://github.com/gottfrois/link_thumbnailer)), the following causes a much higher memory load…...
Example: ```ruby text = '"These should be two different sentences. Of course."' s = PragmaticSegmenter::Segmenter.new(text: text) s.segment # RETURNS: ["\"These should be two different sentences. Of course.\""] # SHOULD RETURN:...
This pull requests solves the problems with catastrophic backtracking that are described in issue #78. Although the regular expression is now somewhat more restrictive, I think it will cover the...
The regular expression `NUMBERED_REFERENCE_REGEX` in `numbers.rb` is prone to catastrophic backtracking. This can happen if the text contains large numbers following the decimal point. For example, the following text will...