django icon indicating copy to clipboard operation
django copied to clipboard

Fixed #28805 -- Added regular expression database functions.

Open ngnpope opened this issue 4 years ago • 5 comments

This is a work in progress attempt to implement RegexpReplace(), etc. for ticket-28805.

Backend Support:

PostgreSQL:

MySQL:

MariaDB:

Oracle:

SQLite:

Implementation:

  • RegexpReplace(expression, pattern, replacement, flags)
  • RegexpSubstr(expression, pattern, flags)
  • RegexpStrIndex(expression, pattern, flags)

Notes:

  • position is only supported on MySQL, Oracle, and PostgreSQL 15+ and is used to indicate where matches should begin. Support for this will not be implemented, save to pass in the default value of 1 where subsequent function arguments are required.

  • By default PostgreSQL replaces the first occurrence only unless 'g' is passed in flags to replace all occurrences. MySQL & Oracle replace all occurrences by default (occurrence=0) or the specified occurrence. SQLite, using Python's re.sub() will replace all occurrences by default (count=0) or up to count occurrences in the string. MariaDB only supports replacing all occurrences.
    We will pass 1 to the underlying count/occurrence parameter by default and accept 'g' in flags to pass to PostgreSQL or 0 to the underlying count/occurrence. MariaDB will ignore the 'g' flag as it will always replace everything.

  • return_opt is only supported on MySQL and Oracle for REGEXP_INSTR and is used to control which position value is returtned. Support for this will not be implemented, save to pass in the default value of 0 where subsequent function arguments are required.

  • The case-sensitive ('c') and case-insensitive ('i') flags are supported by most backends. It seems that later flags specified take precedence over earlier ones. SQLite, using Python, is always case-sensitive by default and only supports 'i', but if 'c' is present after 'i', we can cancel the case-insensitivity.
    MariaDB ~~doesn't support being passed~~ only supports inline flags. We can support 'c' and 'i', by prefixing pattern with (?-i) and (?i) respectively. It also seems that MariaDB is case-insensitive by default (unless we have some weird collation configuration on the Django CI).
    MySQL also seems to be case-insensitive by default.

  • The multi-line flag ('m') seems to work similarly across all backends. PostgreSQL supports the value 'm', but the canonical value for the flag is 'n'.

  • The dotall flag ('s') seems to work similarly across all backends, but is the default on PostgreSQL. We ~~may want to~~ pass the 'p' flag by default for PostgreSQL to get it to behave like other backends -- it is also documented. Oracle and MySQL use the value 'n' for this flag instead.

  • The extended flag ('x') is not supported by MySQL. Oracle does not support comments in the pattern when using 'x', but all other backends do.

Issues:

  • ~MySQL tests are currently skipped as Django CI doesn't have 8.0.4+.~ (No longer a problem - this was started so long ago!)

ngnpope avatar Feb 09 '20 14:02 ngnpope

The main issue for me is that's it's not adjustable for 3rd-party database backends. Ideally, we could add a several hooks and call them in as_{vendor} methods instead of checking multiple vendors in as_sql().

So I've rejigged this in a way that should address this concern. Feel free to review those changes.

I'm now just waiting for the oracle tests to complete to see if I still get the failures I was experiencing. I suspect that I'll have to follow the advice recently added in 9369f0cebba1f65909a14dec6aa3515ec1eb2557.

ngnpope avatar Sep 17 '20 14:09 ngnpope

I'm now just waiting for the oracle tests to complete to see if I still get the failures I was experiencing. I suspect that I'll have to follow the advice recently added in 9369f0c.

So the failures with Oracle remain:

db_functions.text.test_regexpreplace.RegexpReplaceFlagTests.test_dotall_flag
db_functions.text.test_regexpreplace.RegexpReplaceFlagTests.test_extended_flag
db_functions.text.test_regexpreplace.RegexpReplaceFlagTests.test_global_flag
db_functions.text.test_regexpreplace.RegexpReplaceFlagTests.test_multiline_flag
db_functions.text.test_regexpsubstr.RegexpSubstrFlagTests.test_dotall_flag
db_functions.text.test_regexpsubstr.RegexpSubstrFlagTests.test_extended_flag
db_functions.text.test_regexpsubstr.RegexpSubstrFlagTests.test_multiline_flag

I've noticed that the failing tests all have one thing in common: They pass Article.text as their first argument - a TextField - and not Article.title as the other tests use - a CharField.

@felixxm I'm wondering whether this is a more general problem with Oracle? I see that the majority of tests for text database functions don't test with TextField inputs. You mentioned being unable to reproduce the issue with LPad or Left, but were you using a CharField? Do you get the same issue if you pass in a TextField, e.g. Article.text, as the first argument? (I don't currently have an Oracle instance to test...) :detective:

ngnpope avatar Sep 21 '20 11:09 ngnpope

buildbot, test on oracle.

ngnpope avatar Jul 17 '21 11:07 ngnpope

@felixxm This has been nearly ready for a long time so I thought I'd try to finish it.

I've tested it out with the change that you suggested in https://github.com/django/django/pull/12438#discussion_r472006730 and all the tests pass. It doesn't appear to affect anything else. Is there any reason why that wouldn't be acceptable?

ngnpope avatar Jul 17 '21 11:07 ngnpope

Thanks for all you efforts +1 and really sorry for the late reply.

It happens!

I'm still a bit skeptical about juggling so many database-specific flags, it's can be really hard to maintain. Personally, I'm -0 about this feature.

After two years of keeping this going with no discouragement, I hope we don't jump to a no too quickly...

We'll have to see if we can break things up further to deal with the flags better.

I left some comments, but we need to make few more things to make it reviewable again:

Have rebased, blackened, re-targeted to 4.1, and addressed the majority of comments. A couple of the comments will require a bit more thought, however.


As an aside, PostgreSQL 15 looks like it is going to learn REGEXP_INSTR, REGEXP_REPLACE and REGEXP_SUBSTR - among others - which will eventually allow some of the hacks for PostgreSQL to be phased out:

  • https://www.postgresql.org/docs/devel/functions-string.html
  • https://commitfest.postgresql.org/34/3042/

ngnpope avatar Mar 20 '22 22:03 ngnpope

buildbot, test on oracle.

ngnpope avatar Jul 31 '23 09:07 ngnpope