datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Improve performance of `regex_replace`

Open Dandandan opened this issue 3 years ago • 0 comments

Is your feature request related to a problem or challenge? Please describe what you are trying to do. Currently by far the worst performing query in ClickBench for datafusion is query 28:

SELECT REGEXP_REPLACE("Referer", '^https?://(?:www.)?([^/]+)/.*$', '1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;

It seems this is due too REGEXP_REPLACE in this query.

Describe the solution you'd like Introduce a benchmark and improve the performance of REGEXP_REPLACE (with a scalar) in DataFusion.

Describe alternatives you've considered n/a Additional context

There are some more efficient patterns in arrow-rs for writing (string/regex) kernels. This could be used here as well. Optimizations that might be beneficial:

  • reusing the null buffer from the input array (instead of rebuilding in the iterator)
  • special casing scalar type, so no hashmap has to be used for storing / retrieving compiled regexes

Dandandan avatar Sep 16 '22 15:09 Dandandan