datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Add additional regexp functions

Open timsaucer opened this issue 1 year ago • 8 comments

Is your feature request related to a problem or challenge?

I would like to see the following regexp functions implemented. These exist in some, but not all, versions of PostgreSQL.

Describe the solution you'd like

Implement these functions.

Describe alternatives you've considered

These operations can be performed using the existing functions, so I am currently unblocked for my immediate use case but having these functions built in would be convenient.

Additional context

We currently have the following regexp functions implemented. The source is in datafusion/functions/src/regex/mod.rs

regexp_like() regexp_match() regexp_replace()

timsaucer avatar Aug 12 '24 12:08 timsaucer

I could work on this. The only concern is whether we implement the regexp function in this project or in arrow-rs.

Hey @alamb, would you prefer implement function in arrow-rs directly or we put them in datafusion and port later?

xinlifoobar avatar Aug 16 '24 13:08 xinlifoobar

I could work on this. The only concern is whether we implement the regexp function in this project or in arrow-rs.

Hey @alamb, would you prefer implement function in arrow-rs directly or we put them in datafusion and port later?

Thanks @xinlifoobar

I would personally recommend we start implementing them in datafusion as that will avoid the need to wait for coordinated releases of arrow-rs, and then port backupstream to arrow-rs as a follow on step.

alamb avatar Aug 17 '24 10:08 alamb

@xinlifoobar I suspect there will be several other contribtuors interested in helping out and learning during the process. If we have a good example to follow the work would be straightforward to scale I think

One way to do this might be:

  1. You implement one of these functions in a PR, along with good docs, tests, etc
  2. Then we can file additional tickets for the other functions, linking to your first implementation

alamb avatar Aug 17 '24 10:08 alamb

Related to this, substring in Postgres supports regex matching (see https://www.postgresql.org/docs/current/functions-matching.html), would it be reasonable for DataFusion to also support it?

The currently accepted argument types are:

                    Exact(vec![Utf8, Int64]),
                    Exact(vec![LargeUtf8, Int64]),
                    Exact(vec![Utf8, Int64, Int64]),
                    Exact(vec![LargeUtf8, Int64, Int64]),
                    Exact(vec![Utf8View, Int64]),
                    Exact(vec![Utf8View, Int64, Int64]),

Postgres's regex substring takes a string, a pattern, and an escape character, so I don't think there would be a conflict.

nrc avatar Aug 20 '24 09:08 nrc

Related to this, substring in Postgres supports regex matching (see https://www.postgresql.org/docs/current/functions-matching.html), would it be reasonable for DataFusion to also support it? Postgres's regex substring takes a string, a pattern, and an escape character, so I don't think there would be a conflict.

Spark's version of this is https://spark.apache.org/docs/latest/api/sql/#regexp_substr

Omega359 avatar Aug 20 '24 13:08 Omega359

Based on prior conversations it sounds like the group is most interested in making sure we are supporting Postgresql so I think adding this is a very good idea. We can also have regexp_substr as an alias.

timsaucer avatar Aug 20 '24 14:08 timsaucer

It is a good idea to support these functions, especially for regexp_matches, Thank you.

hengfeiyang avatar Oct 18 '24 08:10 hengfeiyang

It is a good idea to support these functions, especially for regexp_matches, Thank you.

Adding regexp_matches would be trivial - it's essentially the current regexp_match but the 'g' (global) flag is always present (whereas with the regexp_match it's disallowed)

Omega359 avatar Oct 18 '24 13:10 Omega359

Related to this, substring in Postgres supports regex matching (see https://www.postgresql.org/docs/current/functions-matching.html), would it be reasonable for DataFusion to also support it?

The currently accepted argument types are:

                    Exact(vec![Utf8, Int64]),
                    Exact(vec![LargeUtf8, Int64]),
                    Exact(vec![Utf8, Int64, Int64]),
                    Exact(vec![LargeUtf8, Int64, Int64]),
                    Exact(vec![Utf8View, Int64]),
                    Exact(vec![Utf8View, Int64, Int64]),

Postgres's regex substring takes a string, a pattern, and an escape character, so I don't think there would be a conflict.

I took a look at the syntax for that pg function, and frankly it's awful. Personally I think that is a function best ignored.

Omega359 avatar Oct 22 '24 14:10 Omega359

I see your point for the postgres - since it essentially does either substring from this character index to that OR regex matching, I think it would probably add more confusion than value to support in exactly the way they do. However I'm far from a SQL expert.

timsaucer avatar Oct 22 '24 14:10 timsaucer

I can implement regexp_substr() can I just create a pull request or do I need to be added to some access list?

osipovartem avatar Jan 23 '25 14:01 osipovartem

I can implement regexp_substr() can I just create a pull request or do I need to be added to some access list?

A PR should be fine. You should have a contributor agreement in place as well.

Omega359 avatar Jan 23 '25 15:01 Omega359

👋 @osipovartem -- thank you!

A PR should be fine. You should have a contributor agreement in place as well.

Note this is not a requirement and we merge many PRs from contributors without one on place

Best practice would be to have one in place, however.

alamb avatar Jan 23 '25 23:01 alamb