rivescript-python icon indicating copy to clipboard operation
rivescript-python copied to clipboard

Revisit regular expression substitutions for Python 3.6

Open kirsle opened this issue 7 years ago • 1 comments

Python 3.6 is creating a breaking change to the way re.sub() works, in that any unknown escape sequence in the repl parameter (a backslash followed by any ASCII letter) becomes a runtime exception.

This commit fixes the particular case that was causing the unit tests in Travis CI to fail under Python 3.6.0a2 (nightly build).

In normal (non-UTF8) operation of RiveScript it should be difficult to trigger a similar exception, but with UTF-8 mode enabled and specially crafted RiveScript source files it could become possible to crash the library.

Every instance of re.sub() should be visited to make sure that no escape sequence will ever appear as the repl parameter that will cause a runtime crash of the program. Even if the user disables strict mode.

See also: https://docs.python.org/3.6/library/re.html#re.sub

Changed in version 3.6: Unknown escapes consisting of '\' and an ASCII letter now are errors.

Python 3.6 is set for release on December 12, 2016 and RiveScript should be ready for it before then.

kirsle avatar Jul 08 '16 22:07 kirsle

Possible problematic cases:

In _reply_regexp near line 2299 when handling optionals in triggers:

            regexp = re.sub(r'\s*\[' + re.escape(match) + '\]\s*',
                '(?:' + pipes + r'|(?:\\s|\\b))', regexp)

If the contents of pipes can contain any special characters this could break. Interestingly, the \\b doesn't raise an exception, and according to the new (3.6) docs, \\s should get expanded into a literal space character which probably isn't what we want either (tabs could break it).

Further down near line 2311 when it handles arrays:

regexp = re.sub(r'\@' + re.escape(array) + r'\b', rep, regexp)

To be safe we could replace this with a standard string replacement.

Possible things to test for breakage with optionals in triggers:

  • The user's message should be able to contain tab characters \t instead of spaces on either side of an optional and the message should still match (tests the \s being replaced with a space problem)
  • Do more testing in Python 3.6 in general to see why it likes \b as an escape sequence in the repl parameter. It might not be doing what I expect.
  • Turn off strict mode, turn on UTF-8 mode, and try to break it by putting escaped sequences in the literal optional text of a trigger.

All the other instances of re.sub replace a pattern with an empty string or a hard-coded string that contains no escape sequences.

kirsle avatar Jul 08 '16 22:07 kirsle