datafusion
datafusion copied to clipboard
Improve performance of `regex_replace`
Is your feature request related to a problem or challenge? Please describe what you are trying to do. Currently by far the worst performing query in ClickBench for datafusion is query 28:
SELECT REGEXP_REPLACE("Referer", '^https?://(?:www.)?([^/]+)/.*$', '1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer") FROM hits WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;
It seems this is due too REGEXP_REPLACE in this query.
Describe the solution you'd like
Introduce a benchmark and improve the performance of REGEXP_REPLACE (with a scalar) in DataFusion.
Describe alternatives you've considered n/a Additional context
There are some more efficient patterns in arrow-rs for writing (string/regex) kernels. This could be used here as well. Optimizations that might be beneficial:
- reusing the null buffer from the input array (instead of rebuilding in the iterator)
- special casing scalar type, so no hashmap has to be used for storing / retrieving compiled regexes