Generex icon indicating copy to clipboard operation
Generex copied to clipboard

StackOverflowError from pattern input

Open spacether opened this issue 5 years ago • 2 comments

Thank you so much for making this tool! When testing it, I ran into this case that causes a stack overflow:

String pattern = "^[a-zA-Z\s]*$";
Generex generex = new Generex(pattern);
String firstMatch = generex.getFirstMatch();

And when that code is run I get this exception:

Exception in thread "main" java.lang.StackOverflowError
	at java.base/java.util.HashMap$HashIterator.<init>(HashMap.java:1475)
	at java.base/java.util.HashMap$KeyIterator.<init>(HashMap.java:1514)
	at java.base/java.util.HashMap$KeySet.iterator(HashMap.java:912)
	at java.base/java.util.HashSet.iterator(HashSet.java:173)
	at java.base/java.util.AbstractCollection.toArray(AbstractCollection.java:184)
	at dk.brics.automaton.State.getSortedTransitionArray(Unknown Source)
	at dk.brics.automaton.State.getSortedTransitions(Unknown Source)
	at com.mifmif.common.regex.Generex.prepareTransactionNodes(Generex.java:265)

Could this be fixed?

spacether avatar Aug 11 '20 02:08 spacether

getFirstMatch() JavaDoc:

first string in lexicographical order that is matched by the given pattern.

Which means it tries to sort ALL of the generated matches. Your regex is infinite (it has the Kleene star), so ofc you get a SO. The solution is to use the lazy iterator:

String pattern = "^[a-zA-Z\\s]*$";
Generex generex = new Generex(pattern);
final Iterator matchesGenerator = generex.iterator();
if (matchesGenerator.hasNext()) {
	String firstMatch = matchesGenerator.next();
	System.out.println(firstMatch); // ^\t$
}

The output is however probably not what you wanted, because ^ and $ are not special characters in the used grammar (https://www.brics.dk/automaton/doc/index.html?dk/brics/automaton/RegExp.html). Omitting the anchors should make no difference, I think the regex matches the whole string anyway. As you can see it is not identical to Java regexes, but close enough... Even though there are special characters: "@~&<#. They are marked optional in the used Automaton, however Generex uses all of them by default, sadly. In my fork I added the option to turn them off with the NONE flag - you could clone & install my devel branch if you want to try it.

HawkSK avatar Aug 16 '20 18:08 HawkSK

Thank you for explaining what's happening with this input. I understand that this is caused by the infinite regex. In my opinion an infinite regex should still have a deterministic first match and should not cause a stack overflow. For my use case, using this different package better meets my needs.

spacether avatar Aug 16 '20 20:08 spacether