spark-rapids
spark-rapids copied to clipboard
[FEA] Validate the size/complexity of regular expressions
Is your feature request related to a problem? Please describe. Once https://github.com/NVIDIA/spark-rapids/pull/4044 is merged, we have the ability to parse regular expressions. We could analyze the parsed expression and estimate the complexity/cost of GPU kernels and fall back to CPU if the GPU cost is high.
Describe the solution you'd like See https://github.com/NVIDIA/spark-rapids/pull/4044#discussion_r745677785 for more details.
Describe alternatives you've considered None
Additional context None
Note that if we end up needing to increase the default reserve memory to better handle complex regex kernels being launched with large thread stack space requirements, this may impact the use-cases that prompted lowering the default reserve recently. cc: @rongou
The main source of OOM errors for regex kernels comes from a stack allocation of 11 bytes per instruction per string at kernel launch. “Small” for 1-10 instructions, “medium” for 11-100 instructions and “large” for 101-1000 instructions.
We need to clarify exactly what "instructions" means and how it maps to the RegexAST that we have in the RAPIDS accelerator, but here is one example of GPU regexp code showing some instructions.
Flags = 0x00000000
Instructions:
0: CHAR c='3', next=2
1: OR right=0, left=2, next=2
2: RBRA id=1, next=4
3: LBRA id=1, next=1
4: OR right=3, left=5, next=5
5: END
startinst_id=3
startinst_ids: [ 3 -1]
Classes 0
Number of capturing groups: 1
Also, see cuDF cpp/src/strings/regex/regcomp.h:
enum InstType {
CHAR = 0177, // Literal character
RBRA = 0201, // Right bracket, )
LBRA = 0202, // Left bracket, (
OR = 0204, // Alternation, |
ANY = 0300, // Any character except newline, .
ANYNL = 0301, // Any character including newline, .
BOL = 0303, // Beginning of line, ^
EOL = 0304, // End of line, $
CCLASS = 0305, // Character class, []
NCCLASS = 0306, // Negated character class, []
BOW = 0307, // Boundary of word, /b
NBOW = 0310, // Not boundary of word, /b
END = 0377 // Terminate: match found
};
@revans2 @jlowe Would this feature still be required if cuDF global-memory use can be made to perform near the speed of stack memory? The memory would come from RMM instead of the CUDA stack in this case.
No, I don't think it would be necessary. If the regex kernels don't have an appreciably larger footprint than any other random libcudf kernel then I don't see the need to call it out explicitly.
No, I don't think it would be necessary. If the regex kernels don't have an appreciably larger footprint than any other random libcudf kernel then I don't see the need to call it out explicitly.
Exactly. This is a work around to issues with memory problems when running regular expressions.
~~This PR rapidsai/cudf#10808 is probably useful for this issue~~. This PR has been closed due to the device requirement of the API in question.
Also, see this issue for the more desired functionality from cuDF: https://github.com/rapidsai/cudf/issues/10852
Here's a possible way to compute a first pass of this without a device:
- First, note that regular expressions are evaluated using a state machine.
- The
RegexASTclass contains individual instances that could be evaluated as a state in the state machine, so - One can use the existing parser/transpiler code to calculate how many states will potentially be needed to be evaluated in the final regular expression that will be sent to cuDF.
- Here's a rough example (there may be more states needed, but will need to research more on this topic)
A$will be transpiled toA(?:[\n\r\u0085\u2028\u2029]|\r\n)?$- evaluating the
Achar - 1 state - evaluating the group - no capture, so no state needed
- first character class
[\n\r\u0085\u2028\u2029]- 2 states - choice - 3 states
- second character class
\r\n- 4 states - end of string anchor
$- 5 states
- so each evaluation of this regexp needs 5 states, which can then be multiplied by the number of rows in the batch sent to the GPU.
numStates * numRows = complexity - if
complexity > maxComplexity, then we will fallback to the CPU maxComplexitycan be user configurable, something likespark.rapids.sql.regexp.maxComplexity(and this can default to an arbitrarily high value).- Keep in mind also that this can be evaluated recursively in the transpiler code, and the exception can be thrown in that code to fallback to the CPU.
Fixed in https://github.com/NVIDIA/spark-rapids/pull/6006