pragmatic_segmenter
pragmatic_segmenter copied to clipboard
Infinite Loop
Hi,
When I use this great tool for preprocessing wikipedia dumps, I encountered the infinite loop and failed with NoMemoryError.
Example:
When we input
'' (a '\0 !\0')
with "en" option to pragmatic segmenter,
sub_4 = sub_characters(sub_3, '!', '&ᓴ&')
at https://github.com/diasks2/pragmatic_segmenter/blob/master/lib/pragmatic_segmenter/punctuation_replacer.rb#L55
causes the infinite loop.
I'm wondering if we can solve this problem by escaping '\0' in sub_characters function.
def sub_characters(string, char_a, char_b)
sub = string.gsub(char_a, char_b).gsub('\\0', '\\\\\0')
@text.gsub!(/#{Regexp.escape(string)}/, sub)
sub
end
Thanks!
We have this same problem, though I haven't managed to figure out the character sequence that is causing it. I'll try doing your gsub and see if it fixes it.
We've encountered this problem as well. This can be fixed by replacing:
@text.gsub!(/#{Regexp.escape(string)}/, sub)
By:
@text.gsub!(string, sub)
There is no need to use a regexp since we want exact match. I would submit a PR but this package seems unmaintained judging by the age and seriousness of the issues.