django
django copied to clipboard
Fixed #28805 -- Added regular expression database functions.
This is a work in progress attempt to implement RegexpReplace()
, etc. for ticket-28805.
Backend Support:
PostgreSQL:
-
regexp_count(expr, pattern, position?, flags?)
-- v15+ only -
regexp_instr(expr, pattern, position?, occurrence?, return_opt?, flags?, subexpr?)
-- v15+ only -
regexp_replace(expr, pattern, replacement, position?, occurrence?, flags?)
-
position
andoccurrence
for v15+ only
-
-
regexp_substr(expr, pattern, position?, occurrence?, flags?, subexpr?)
-- v15+ only -
substr(expr from pattern)
-- required for <v15
MySQL:
-
REGEXP_REPLACE(expr, pattern, replacement, position?, occurrence?, flags?)
-
REGEXP_SUBSTR(expr, pattern, position?, occurrence?, flags?)
-
REGEXP_INSTR(expr, pattern, position?, occurrence?, return_opt?, flags?)
MariaDB:
Oracle:
-
REGEXP_REPLACE(expr, pattern, replacement, position?, occurrence?, flags?)
-
REGEXP_SUBSTR(expr, pattern, position?, occurrence?, flags?, subexpr?)
-
REGEXP_INSTR(expr, pattern, position?, occurrence?, return_opt?, flags?, subexpr?)
-
REGEXP_COUNT(expr, pattern, position?, flags?)
SQLite:
Implementation:
-
RegexpReplace(expression, pattern, replacement, flags)
-
RegexpSubstr(expression, pattern, flags)
-
RegexpStrIndex(expression, pattern, flags)
Notes:
-
position
is only supported on MySQL, Oracle, and PostgreSQL 15+ and is used to indicate where matches should begin. Support for this will not be implemented, save to pass in the default value of1
where subsequent function arguments are required.
- By default PostgreSQL replaces the first occurrence only unless
'g'
is passed inflags
to replace all occurrences. MySQL & Oracle replace all occurrences by default (occurrence=0
) or the specified occurrence. SQLite, using Python'sre.sub()
will replace all occurrences by default (count=0
) or up tocount
occurrences in the string. MariaDB only supports replacing all occurrences.
We will pass1
to the underlyingcount
/occurrence
parameter by default and accept'g'
inflags
to pass to PostgreSQL or0
to the underlyingcount
/occurrence
. MariaDB will ignore the'g'
flag as it will always replace everything.
-
return_opt
is only supported on MySQL and Oracle forREGEXP_INSTR
and is used to control which position value is returtned. Support for this will not be implemented, save to pass in the default value of0
where subsequent function arguments are required.
- The case-sensitive (
'c'
) and case-insensitive ('i'
) flags are supported by most backends. It seems that later flags specified take precedence over earlier ones. SQLite, using Python, is always case-sensitive by default and only supports'i'
, but if'c'
is present after'i'
, we can cancel the case-insensitivity.
MariaDB ~~doesn't support being passed~~ only supports inline flags. We can support'c'
and'i'
, by prefixingpattern
with(?-i)
and(?i)
respectively. It also seems that MariaDB is case-insensitive by default (unless we have some weird collation configuration on the Django CI).
MySQL also seems to be case-insensitive by default.
- The multi-line flag (
'm'
) seems to work similarly across all backends. PostgreSQL supports the value'm'
, but the canonical value for the flag is'n'
.
- The dotall flag (
's'
) seems to work similarly across all backends, but is the default on PostgreSQL. We ~~may want to~~ pass the'p'
flag by default for PostgreSQL to get it to behave like other backends -- it is also documented. Oracle and MySQL use the value'n'
for this flag instead.
- The extended flag (
'x'
) is not supported by MySQL. Oracle does not support comments in thepattern
when using'x'
, but all other backends do.
Issues:
- ~MySQL tests are currently skipped as Django CI doesn't have 8.0.4+.~ (No longer a problem - this was started so long ago!)
The main issue for me is that's it's not adjustable for 3rd-party database backends. Ideally, we could add a several hooks and call them in
as_{vendor}
methods instead of checking multiple vendors inas_sql()
.
So I've rejigged this in a way that should address this concern. Feel free to review those changes.
I'm now just waiting for the oracle tests to complete to see if I still get the failures I was experiencing. I suspect that I'll have to follow the advice recently added in 9369f0cebba1f65909a14dec6aa3515ec1eb2557.
I'm now just waiting for the oracle tests to complete to see if I still get the failures I was experiencing. I suspect that I'll have to follow the advice recently added in 9369f0c.
So the failures with Oracle remain:
db_functions.text.test_regexpreplace.RegexpReplaceFlagTests.test_dotall_flag
db_functions.text.test_regexpreplace.RegexpReplaceFlagTests.test_extended_flag
db_functions.text.test_regexpreplace.RegexpReplaceFlagTests.test_global_flag
db_functions.text.test_regexpreplace.RegexpReplaceFlagTests.test_multiline_flag
db_functions.text.test_regexpsubstr.RegexpSubstrFlagTests.test_dotall_flag
db_functions.text.test_regexpsubstr.RegexpSubstrFlagTests.test_extended_flag
db_functions.text.test_regexpsubstr.RegexpSubstrFlagTests.test_multiline_flag
I've noticed that the failing tests all have one thing in common: They pass Article.text
as their first argument - a TextField
- and not Article.title
as the other tests use - a CharField
.
@felixxm I'm wondering whether this is a more general problem with Oracle? I see that the majority of tests for text database functions don't test with TextField
inputs. You mentioned being unable to reproduce the issue with LPad
or Left
, but were you using a CharField
? Do you get the same issue if you pass in a TextField
, e.g. Article.text
, as the first argument? (I don't currently have an Oracle instance to test...) :detective:
buildbot, test on oracle.
@felixxm This has been nearly ready for a long time so I thought I'd try to finish it.
I've tested it out with the change that you suggested in https://github.com/django/django/pull/12438#discussion_r472006730 and all the tests pass. It doesn't appear to affect anything else. Is there any reason why that wouldn't be acceptable?
Thanks for all you efforts +1 and really sorry for the late reply.
It happens!
I'm still a bit skeptical about juggling so many database-specific flags, it's can be really hard to maintain. Personally, I'm -0 about this feature.
After two years of keeping this going with no discouragement, I hope we don't jump to a no too quickly...
We'll have to see if we can break things up further to deal with the flags better.
I left some comments, but we need to make few more things to make it reviewable again:
Have rebased, blackened, re-targeted to 4.1, and addressed the majority of comments. A couple of the comments will require a bit more thought, however.
As an aside, PostgreSQL 15 looks like it is going to learn REGEXP_INSTR
, REGEXP_REPLACE
and REGEXP_SUBSTR
- among others - which will eventually allow some of the hacks for PostgreSQL to be phased out:
- https://www.postgresql.org/docs/devel/functions-string.html
- https://commitfest.postgresql.org/34/3042/
buildbot, test on oracle.