PragmaticSegmenterNet icon indicating copy to clipboard operation
PragmaticSegmenterNet copied to clipboard

Infinite loop for certain text

Open Telavian opened this issue 2 years ago • 4 comments

Certain text causes the segmenter to enter into an infinite loop.

var text = "o idioma estao errados.000000000000000000000000000000000000000000000000000000000000000000000000";
Segmenter.Segment(text);

Telavian avatar Apr 26 '22 17:04 Telavian

I am not sure what this is supposed to do however the problem seems to be in ReferenceSeparator.SeparateReferences

The regex ReferenceRegex is used against the text and it seems the 2 are a fatal combination.

Telavian avatar Apr 26 '22 18:04 Telavian

It seems the combination of character, '.', number does not process well with the regex ReferenceRegex.

When testing the regex at https://regexr.com if I type c.#### then as I continue to type any numbers then the execution time gets slower and slower until it eventually timesout at 250ms. Therefore for very large numbers I would expect the execution time is exponentially long.

I am not sure how to test the original ruby version however it seems like since it uses the exact same regex then it likely has the same issues. https://github.com/diasks2/pragmatic_segmenter/blob/1ade491c81f9d1d7fb3abd4c1e2e266fa5b34c42/lib/pragmatic_segmenter/languages/common/numbers.rb#L50

Telavian avatar Apr 26 '22 18:04 Telavian

I am not sure if this is a good solution or even "correct" in general however it does solve my problem.

private static readonly Regex _numericSeparator = new Regex(@"(.\.\d)", RegexOptions.Compiled);
private string PreprocessText(string text)
{
    var matches = _numericSeparator.Matches(text);

    var groups = matches
        .AsEnumerable()
        .SelectMany(x => x.Groups.Values)
        .Select(x => x.Value)
        .Distinct();

    foreach (var group in groups)
    {
        var replacement = group
            .Replace(".", ". ");

        text = text.Replace(group, replacement);
    }

    return text;
}

Telavian avatar Apr 26 '22 18:04 Telavian

The suggested fix may not work in the general case. For example "0.5 ml of milk" will be pre-processed to "0. 5 ml of milk". Further segmentation may separate the "0." from "5 ml of milk"

AndrewLamWARC avatar Feb 03 '24 10:02 AndrewLamWARC