spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[FEA] Validate the size/complexity of regular expressions

Open andygrove opened this issue 4 years ago • 8 comments

Is your feature request related to a problem? Please describe. Once https://github.com/NVIDIA/spark-rapids/pull/4044 is merged, we have the ability to parse regular expressions. We could analyze the parsed expression and estimate the complexity/cost of GPU kernels and fall back to CPU if the GPU cost is high.

Describe the solution you'd like See https://github.com/NVIDIA/spark-rapids/pull/4044#discussion_r745677785 for more details.

Describe alternatives you've considered None

Additional context None

andygrove avatar Nov 09 '21 16:11 andygrove

Note that if we end up needing to increase the default reserve memory to better handle complex regex kernels being launched with large thread stack space requirements, this may impact the use-cases that prompted lowering the default reserve recently. cc: @rongou

jlowe avatar Nov 09 '21 22:11 jlowe

The main source of OOM errors for regex kernels comes from a stack allocation of 11 bytes per instruction per string at kernel launch. “Small” for 1-10 instructions, “medium” for 11-100 instructions and “large” for 101-1000 instructions.

We need to clarify exactly what "instructions" means and how it maps to the RegexAST that we have in the RAPIDS accelerator, but here is one example of GPU regexp code showing some instructions.

Flags = 0x00000000
Instructions:
  0:    CHAR c='3', next=2
  1:      OR right=0, left=2, next=2
  2:    RBRA id=1, next=4
  3:    LBRA id=1, next=1
  4:      OR right=3, left=5, next=5
  5:     END
startinst_id=3
startinst_ids: [ 3 -1]

Classes 0
Number of capturing groups: 1

Also, see cuDF cpp/src/strings/regex/regcomp.h:

enum InstType {
  CHAR    = 0177,  // Literal character
  RBRA    = 0201,  // Right bracket, )
  LBRA    = 0202,  // Left bracket, (
  OR      = 0204,  // Alternation, |
  ANY     = 0300,  // Any character except newline, .
  ANYNL   = 0301,  // Any character including newline, .
  BOL     = 0303,  // Beginning of line, ^
  EOL     = 0304,  // End of line, $
  CCLASS  = 0305,  // Character class, []
  NCCLASS = 0306,  // Negated character class, []
  BOW     = 0307,  // Boundary of word, /b
  NBOW    = 0310,  // Not boundary of word, /b
  END     = 0377   // Terminate: match found
};

andygrove avatar Jan 12 '22 18:01 andygrove

@revans2 @jlowe Would this feature still be required if cuDF global-memory use can be made to perform near the speed of stack memory? The memory would come from RMM instead of the CUDA stack in this case.

andygrove avatar Mar 17 '22 16:03 andygrove

No, I don't think it would be necessary. If the regex kernels don't have an appreciably larger footprint than any other random libcudf kernel then I don't see the need to call it out explicitly.

jlowe avatar Mar 17 '22 16:03 jlowe

No, I don't think it would be necessary. If the regex kernels don't have an appreciably larger footprint than any other random libcudf kernel then I don't see the need to call it out explicitly.

Exactly. This is a work around to issues with memory problems when running regular expressions.

revans2 avatar Mar 17 '22 16:03 revans2

~~This PR rapidsai/cudf#10808 is probably useful for this issue~~. This PR has been closed due to the device requirement of the API in question.

NVnavkumar avatar Jun 21 '22 21:06 NVnavkumar

Also, see this issue for the more desired functionality from cuDF: https://github.com/rapidsai/cudf/issues/10852

NVnavkumar avatar Jun 21 '22 23:06 NVnavkumar

Here's a possible way to compute a first pass of this without a device:

  • First, note that regular expressions are evaluated using a state machine.
  • The RegexAST class contains individual instances that could be evaluated as a state in the state machine, so
  • One can use the existing parser/transpiler code to calculate how many states will potentially be needed to be evaluated in the final regular expression that will be sent to cuDF.
  • Here's a rough example (there may be more states needed, but will need to research more on this topic)
    • A$ will be transpiled to A(?:[\n\r\u0085\u2028\u2029]|\r\n)?$
    • evaluating the A char - 1 state
    • evaluating the group - no capture, so no state needed
    • first character class [\n\r\u0085\u2028\u2029] - 2 states
    • choice - 3 states
    • second character class \r\n - 4 states
    • end of string anchor $ - 5 states
  • so each evaluation of this regexp needs 5 states, which can then be multiplied by the number of rows in the batch sent to the GPU. numStates * numRows = complexity
  • if complexity > maxComplexity, then we will fallback to the CPU
  • maxComplexity can be user configurable, something like spark.rapids.sql.regexp.maxComplexity (and this can default to an arbitrarily high value).
  • Keep in mind also that this can be evaluated recursively in the transpiler code, and the exception can be thrown in that code to fallback to the CPU.

NVnavkumar avatar Jul 08 '22 21:07 NVnavkumar

Fixed in https://github.com/NVIDIA/spark-rapids/pull/6006

andygrove avatar Aug 19 '22 17:08 andygrove