spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

[FEA] Enable regular expressions by default

Open andygrove opened this issue 3 years ago • 3 comments

Is your feature request related to a problem? Please describe. Regular expression support is currently disabled by default due to many known compatibility issues, which are documented in the compatibility guide. This epic is to track the work required to address these issues and enable the feature by default.

Completed

  • [x] https://github.com/NVIDIA/spark-rapids/issues/3797
  • [x] https://github.com/NVIDIA/spark-rapids/issues/3866
  • [x] https://github.com/NVIDIA/spark-rapids/issues/3962
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4001
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4002
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4091
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4170
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4229
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4284
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4467
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4412
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4521
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4503
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4559
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4330
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4475
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4409
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4003

High Priority

  • [x] https://github.com/NVIDIA/spark-rapids/issues/4487
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4557
  • [x] https://github.com/NVIDIA/spark-rapids/issues/5135
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4425
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4468
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4532
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4533
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4800
  • [x] https://github.com/NVIDIA/spark-rapids/issues/5549
  • [x] https://github.com/NVIDIA/spark-rapids/issues/5711
  • [x] https://github.com/NVIDIA/spark-rapids/issues/5521
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4719
  • [ ] https://github.com/NVIDIA/spark-rapids/issues/4511

Medium Priority

  • [x] https://github.com/NVIDIA/spark-rapids/issues/4528
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4517
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4605
  • [x] https://github.com/NVIDIA/spark-rapids/issues/5456
  • [x] https://github.com/NVIDIA/spark-rapids/issues/5525
  • [ ] https://github.com/NVIDIA/spark-rapids/issues/5488
  • [ ] https://github.com/NVIDIA/spark-rapids/issues/5478
  • [x] https://github.com/NVIDIA/spark-rapids/issues/5659
  • [ ] https://github.com/NVIDIA/spark-rapids/issues/5973
  • [ ] https://github.com/NVIDIA/spark-rapids/issues/6469
  • [ ] https://github.com/NVIDIA/spark-rapids/issues/10764

Low Priority

  • [x] https://github.com/NVIDIA/spark-rapids/issues/4486
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4505
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4746
  • [x] https://github.com/NVIDIA/spark-rapids/issues/5415
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4862
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4413
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4866
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4865
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4518
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4519
  • [x] https://github.com/NVIDIA/spark-rapids/issues/5909
  • [x] https://github.com/NVIDIA/spark-rapids/issues/5846
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4720
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4537
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4353
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4283
  • [x] https://github.com/NVIDIA/spark-rapids/issues/5656
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4603
  • [x] https://github.com/NVIDIA/spark-rapids/issues/4061
  • [ ] https://github.com/NVIDIA/spark-rapids/issues/4415

Describe the solution you'd like Support the regular expressions functions and expressions by default with 100% compatibility with Spark:

  • regexp / regexp_like / RLIKE
  • regexp_replace
  • regexp_extract
  • regexp_extract_all
  • split

Describe alternatives you've considered None

Additional context None

andygrove avatar Jan 12 '22 17:01 andygrove

@andygrove FYI I added #4511 to the list, since I think we need to improve the current situation where regex kernels can fail with a confusing OOM error due to insufficient reserved memory rather than insufficient pool memory.

jlowe avatar Jan 12 '22 18:01 jlowe

~Hi @andygrove, I found another bug about regexp_extract #5088. Shall we put it in the list ?~

sperlingxx avatar Mar 30 '22 08:03 sperlingxx

Hi @andygrove, I added #5135 to the list as a high priority task, since I think it is a correctness issue which is not only triggered by corner cases.

sperlingxx avatar Apr 02 '22 08:04 sperlingxx