pragmatic_segmenter Infinite Loop

Infinite Loop

Open censored-- opened this issue 5 years ago • 2 comments

Hi,

When I use this great tool for preprocessing wikipedia dumps, I encountered the infinite loop and failed with NoMemoryError.

Example:

When we input

'' (a '\0 !\0')

with "en" option to pragmatic segmenter, sub_4 = sub_characters(sub_3, '!', '&ᓴ&') at https://github.com/diasks2/pragmatic_segmenter/blob/master/lib/pragmatic_segmenter/punctuation_replacer.rb#L55 causes the infinite loop.

I'm wondering if we can solve this problem by escaping '\0' in sub_characters function.

def sub_characters(string, char_a, char_b)
      sub = string.gsub(char_a, char_b).gsub('\\0', '\\\\\0')
      @text.gsub!(/#{Regexp.escape(string)}/, sub)
      sub
end

Thanks!

Apr 10 '19 09:04 censored--

We have this same problem, though I haven't managed to figure out the character sequence that is causing it. I'll try doing your gsub and see if it fixes it.

May 03 '19 17:05 wflanagan

We've encountered this problem as well. This can be fixed by replacing:

@text.gsub!(/#{Regexp.escape(string)}/, sub)

By:

@text.gsub!(string, sub)

There is no need to use a regexp since we want exact match. I would submit a PR but this package seems unmaintained judging by the age and seriousness of the issues.

May 10 '24 16:05 dbourget

pragmatic_segmenter pragmatic_segmenter copied to clipboard

Infinite Loop

pragmatic_segmenter
pragmatic_segmenter copied to clipboard