lucene icon indicating copy to clipboard operation
lucene copied to clipboard

LUCENE-10474: Avoid throwing StackOverflowError when creating RegExp

Open ywelsch opened this issue 3 years ago • 7 comments

Creating a regular expression using the RegExp class can easily result in a StackOverflowError being thrown, for example when the input is larger than the maximum stack depth. Throwing a StackOverflowError isn't something a user would expect, and it isn't documented either. StackOverflowError is a user-unfriendly exception as it does not convey any intent that the user has done something wrong, but suggests a bug in the implementation.

Instead of letting StackOverflowError bubble up, we now throw an IllegalArgumentException instead to clearly mark this as an input that the implementation can't handle.

#11510

ywelsch avatar Mar 18 '22 14:03 ywelsch

As a library, we should throw the correct exception type, we shouldn't change it for fun. It is not correct to assume that this can only happen as result of union either.

rmuir avatar Mar 18 '22 14:03 rmuir

As a library, we should throw the correct exception type, we shouldn't change it for fun. It is not correct to assume that this can only happen as result of union either.

I'm not sure what you're saying:

  • This is not changing it for fun. There is a proper explanation here you chose to ignore. Further, as outlined in the corresponding Lucene issue (https://issues.apache.org/jira/browse/LUCENE-10474) this patch follows the approach taken by the JDK to provide sensible exception behavior to users.
  • The patch does not assume it is only happening as result of union (the try / catch is in the RegExp constructor, not in the code that parses unions).

ywelsch avatar Mar 18 '22 14:03 ywelsch

Hmm maybe we could we preserve the full StackOverflowException as the cause in the newly thrown IllegalArgumentException? I don't like losing/suppressing that information from the caller.

mikemccand avatar Mar 18 '22 15:03 mikemccand

I'm still -1 for the change. If you hit StackOverFlowError, really you should let the VM exit. There are no guarantees at this point.

I don't care what OpenJDK does here, it is irrelevant to our situation. Because they have "special" mechanisms (annotations) available to them that we don't to provide more guarantees: See https://openjdk.java.net/jeps/270

rmuir avatar Mar 18 '22 15:03 rmuir

I agree with @rmuir - we should not be catching Error. The VM had to unwind the stack and who knows where we are now. If we could somehow detect the problem before it gets to that, then throwing IAE would make sense.

msokolov avatar Mar 18 '22 15:03 msokolov

I'm still -1 for the change. If you hit StackOverFlowError, really you should let the VM exit. There are no guarantees at this point.

That kind of argument makes it even more compelling for StackOverFlowErrors to be avoided in the first place by safeguarding Lucene's RegExp implementation or, if deemed technically too complex, putting a big fat banner on the RegExp class that it's unsafe to use for large inputs.

I would be happy to hear everyone's thoughts here on alternative solutions. For example, how would you feel about computing and passing the stack depth through the parseXYZ methods, and aborting computation at a user-configurable limit (set to 500 for example in default constructor)?

ywelsch avatar Mar 21 '22 14:03 ywelsch

If we can find a clean way to detect imminent stack overflow and throw an exc, that would be great. Maybe a member variable on RegExp would be less intrusive than adding parameters. My one concern is I'm unclear on how this class is maintained -- is it generated code? Maybe it was once generated, and is now manually updated?

msokolov avatar Mar 21 '22 16:03 msokolov

This was addressed in https://github.com/apache/lucene/pull/12462.

jpountz avatar Sep 06 '23 09:09 jpountz