dk.brics.automaton
dk.brics.automaton copied to clipboard
Unexpected poor performance with sub string matching in AutomatonMatcher
My expectation with Brics is that for the following scenario:
- When matching a sub-string inside an input String with the
AutomatonMatcher
via aRunAutomaton
- Regex starts with any character with repetition i.e.
.*
- Input string does not match at all
That the runtime of matching would be equivalent to the number of characters in the input string, i.e. each char would be interrogated once as part of the matching.
E.g.
public boolean find(final String input) {
// automaton is of type RunAutomaton built using tabelize=true
final AutomatonMatcher matcher = automaton.newMatcher(input);
return matcher.find();
}
However, this does not appear to be the case at all. The implementation does some form of "backtracking" in the AutomatonMatcher#find()
method. Is this a known problem or just expected with the ability to match and return the position of matching.
https://github.com/cs-au-dk/dk.brics.automaton/blob/master/src/dk/brics/automaton/AutomatonMatcher.java#L94
public boolean find() {
...
int l = getChars().length();
while (begin < l) {
int p = automaton.getInitialState();
// inner loop
// will "backtrack per char" if the matching eventually fails
for (int i = begin; i < l; i++) {
final int new_state = automaton.step(p, getChars().charAt(i));
if (new_state == -1) {
break;
} else if (automaton.isAccept(new_state)) {
// found a match from begin to (i+1)
match_start = begin;
match_end=(i+1);
}
p = new_state;
}
if (match_start != -1) {
setMatch(match_start, match_end);
return true;
}
begin += 1;
}
if (match_start != -1) {
setMatch(match_start, match_end);
return true;
} else {
setMatch(-2, -2);
return false;
}
}
You see that if a match occurs at position i
, then we will continue until the match fails and then restart from position i+1
.
Adding my own version of this class, with extra counting and logging I can see the following. We initially noticed this due to "poor" performance in relation to testing some inputs and patterns.
In essence we simply count how often automaton.step
is checked.
For a test with:
- Input string length = 100,000 chars
- Input will not match
- Pattern =
.*(rhino[£$%])+
Then we see, calls to automaton.step = 5000050000
Which are way more than my expectation of 100k would be?
I was able to approach the performance I would expect by hacking a version of the RunAutomaton.run
method that returns early if it reaches an accept state and giving any regex a wildcard match at the start. Obviously that doesn't handle position checking.
Please don't get me wrong though this library is a great tool and we appreciate all the work that went into it.
You might argue that the example is a little dumb as it has .*
at the start and we are doing a match but I believe the same problem would also occur with other regexs that use alternations with repetition not at the start of the regex depending on the input string form.
I'm sure I am missing something and will welcome any assistance.