dk.brics.automaton Unexpected poor performance with sub string matching in AutomatonMatcher

Unexpected poor performance with sub string matching in AutomatonMatcher

Open turf00 opened this issue 1 year ago • 4 comments

My expectation with Brics is that for the following scenario:

When matching a sub-string inside an input String with the AutomatonMatcher via a RunAutomaton
Regex starts with any character with repetition i.e. .*
Input string does not match at all

That the runtime of matching would be equivalent to the number of characters in the input string, i.e. each char would be interrogated once as part of the matching.

E.g.

  public boolean find(final String input) {
    // automaton is of type RunAutomaton built using tabelize=true
    final AutomatonMatcher matcher = automaton.newMatcher(input);
    return matcher.find();
  }

However, this does not appear to be the case at all. The implementation does some form of "backtracking" in the AutomatonMatcher#find() method. Is this a known problem or just expected with the ability to match and return the position of matching.

https://github.com/cs-au-dk/dk.brics.automaton/blob/master/src/dk/brics/automaton/AutomatonMatcher.java#L94

   public boolean find() {
...
    int l = getChars().length();
		while (begin < l) {
			int p = automaton.getInitialState();
                         // inner loop
                         // will "backtrack per char" if the matching eventually fails
			for (int i = begin; i < l; i++) {
				final int new_state = automaton.step(p, getChars().charAt(i));
				if (new_state == -1) {
				    break;
				} else if (automaton.isAccept(new_state)) {
				    // found a match from begin to (i+1)
				    match_start = begin;
				    match_end=(i+1);
				}
				p = new_state;
			}
			if (match_start != -1) {
				setMatch(match_start, match_end);
				return true;
			}
			begin += 1;
		}
		if (match_start != -1) {
			setMatch(match_start, match_end);
			return true;
		} else {
			setMatch(-2, -2);
			return false;
		}
    }

You see that if a match occurs at position i, then we will continue until the match fails and then restart from position i+1.

Adding my own version of this class, with extra counting and logging I can see the following. We initially noticed this due to "poor" performance in relation to testing some inputs and patterns.

In essence we simply count how often automaton.step is checked.

For a test with:

Input string length = 100,000 chars
Input will not match
Pattern = .*(rhino[£$%])+

Then we see, calls to automaton.step = 5000050000

Which are way more than my expectation of 100k would be?

I was able to approach the performance I would expect by hacking a version of the RunAutomaton.run method that returns early if it reaches an accept state and giving any regex a wildcard match at the start. Obviously that doesn't handle position checking.

Please don't get me wrong though this library is a great tool and we appreciate all the work that went into it.

You might argue that the example is a little dumb as it has .* at the start and we are doing a match but I believe the same problem would also occur with other regexs that use alternations with repetition not at the start of the regex depending on the input string form.

I'm sure I am missing something and will welcome any assistance.

Mar 23 '23 17:03 turf00

dk.brics.automaton dk.brics.automaton copied to clipboard

Unexpected poor performance with sub string matching in AutomatonMatcher

dk.brics.automaton
dk.brics.automaton copied to clipboard