PragmaticSegmenterNet
PragmaticSegmenterNet copied to clipboard
Infinite loop for certain text
Certain text causes the segmenter to enter into an infinite loop.
var text = "o idioma estao errados.000000000000000000000000000000000000000000000000000000000000000000000000";
Segmenter.Segment(text);
I am not sure what this is supposed to do however the problem seems to be in ReferenceSeparator.SeparateReferences
The regex ReferenceRegex is used against the text and it seems the 2 are a fatal combination.
It seems the combination of character, '.', number does not process well with the regex ReferenceRegex.
When testing the regex at https://regexr.com if I type c.#### then as I continue to type any numbers then the execution time gets slower and slower until it eventually timesout at 250ms. Therefore for very large numbers I would expect the execution time is exponentially long.
I am not sure how to test the original ruby version however it seems like since it uses the exact same regex then it likely has the same issues. https://github.com/diasks2/pragmatic_segmenter/blob/1ade491c81f9d1d7fb3abd4c1e2e266fa5b34c42/lib/pragmatic_segmenter/languages/common/numbers.rb#L50
I am not sure if this is a good solution or even "correct" in general however it does solve my problem.
private static readonly Regex _numericSeparator = new Regex(@"(.\.\d)", RegexOptions.Compiled);
private string PreprocessText(string text)
{
var matches = _numericSeparator.Matches(text);
var groups = matches
.AsEnumerable()
.SelectMany(x => x.Groups.Values)
.Select(x => x.Value)
.Distinct();
foreach (var group in groups)
{
var replacement = group
.Replace(".", ". ");
text = text.Replace(group, replacement);
}
return text;
}
The suggested fix may not work in the general case. For example "0.5 ml of milk" will be pre-processed to "0. 5 ml of milk". Further segmentation may separate the "0." from "5 ml of milk"