Add additional regexp functions
Is your feature request related to a problem or challenge?
I would like to see the following regexp functions implemented. These exist in some, but not all, versions of PostgreSQL.
- [ ] regexp_count()
- [ ] regexp_instr()
- [ ] regexp_matches()
- [ ] regexp_split_to_array()
- [ ] regexp_split_to_table()
- [ ] regexp_substr()
Describe the solution you'd like
Implement these functions.
Describe alternatives you've considered
These operations can be performed using the existing functions, so I am currently unblocked for my immediate use case but having these functions built in would be convenient.
Additional context
We currently have the following regexp functions implemented. The source is in datafusion/functions/src/regex/mod.rs
I could work on this. The only concern is whether we implement the regexp function in this project or in arrow-rs.
Hey @alamb, would you prefer implement function in arrow-rs directly or we put them in datafusion and port later?
I could work on this. The only concern is whether we implement the regexp function in this project or in
arrow-rs.Hey @alamb, would you prefer implement function in
arrow-rsdirectly or we put them indatafusionand port later?
Thanks @xinlifoobar
I would personally recommend we start implementing them in datafusion as that will avoid the need to wait for coordinated releases of arrow-rs, and then port backupstream to arrow-rs as a follow on step.
@xinlifoobar I suspect there will be several other contribtuors interested in helping out and learning during the process. If we have a good example to follow the work would be straightforward to scale I think
One way to do this might be:
- You implement one of these functions in a PR, along with good docs, tests, etc
- Then we can file additional tickets for the other functions, linking to your first implementation
Related to this, substring in Postgres supports regex matching (see https://www.postgresql.org/docs/current/functions-matching.html), would it be reasonable for DataFusion to also support it?
The currently accepted argument types are:
Exact(vec![Utf8, Int64]),
Exact(vec![LargeUtf8, Int64]),
Exact(vec![Utf8, Int64, Int64]),
Exact(vec![LargeUtf8, Int64, Int64]),
Exact(vec![Utf8View, Int64]),
Exact(vec![Utf8View, Int64, Int64]),
Postgres's regex substring takes a string, a pattern, and an escape character, so I don't think there would be a conflict.
Related to this,
substringin Postgres supports regex matching (see https://www.postgresql.org/docs/current/functions-matching.html), would it be reasonable for DataFusion to also support it? Postgres's regex substring takes a string, a pattern, and an escape character, so I don't think there would be a conflict.
Spark's version of this is https://spark.apache.org/docs/latest/api/sql/#regexp_substr
Based on prior conversations it sounds like the group is most interested in making sure we are supporting Postgresql so I think adding this is a very good idea. We can also have regexp_substr as an alias.
It is a good idea to support these functions, especially for regexp_matches, Thank you.
It is a good idea to support these functions, especially for
regexp_matches, Thank you.
Adding regexp_matches would be trivial - it's essentially the current regexp_match but the 'g' (global) flag is always present (whereas with the regexp_match it's disallowed)
Related to this,
substringin Postgres supports regex matching (see https://www.postgresql.org/docs/current/functions-matching.html), would it be reasonable for DataFusion to also support it?The currently accepted argument types are:
Exact(vec![Utf8, Int64]), Exact(vec![LargeUtf8, Int64]), Exact(vec![Utf8, Int64, Int64]), Exact(vec![LargeUtf8, Int64, Int64]), Exact(vec![Utf8View, Int64]), Exact(vec![Utf8View, Int64, Int64]),Postgres's regex substring takes a string, a pattern, and an escape character, so I don't think there would be a conflict.
I took a look at the syntax for that pg function, and frankly it's awful. Personally I think that is a function best ignored.
I see your point for the postgres - since it essentially does either substring from this character index to that OR regex matching, I think it would probably add more confusion than value to support in exactly the way they do. However I'm far from a SQL expert.
I can implement regexp_substr() can I just create a pull request or do I need to be added to some access list?
I can implement regexp_substr() can I just create a pull request or do I need to be added to some access list?
A PR should be fine. You should have a contributor agreement in place as well.
👋 @osipovartem -- thank you!
A PR should be fine. You should have a contributor agreement in place as well.
Note this is not a requirement and we merge many PRs from contributors without one on place
Best practice would be to have one in place, however.