pytorch-openai-transformer-lm
pytorch-openai-transformer-lm copied to clipboard
Potentially incorrect regex in text_utils.py
Hi, we have some of your regex's in AllenNLP and Python has been warning us about them for a while.
https://github.com/huggingface/pytorch-openai-transformer-lm/blob/master/text_utils.py#L30
'''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)'''<input>:1: DeprecationWarning: invalid escape sequence \?
<input>:1: DeprecationWarning: invalid escape sequence \?
<input>:1: DeprecationWarning: invalid escape sequence \?
In [38]: '''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)'''
<input>:1: DeprecationWarning: invalid escape sequence \?
<input>:1: DeprecationWarning: invalid escape sequence \?
<input>:1: DeprecationWarning: invalid escape sequence \?
<ipython-input-38-9a7773b0447c>:1: DeprecationWarning: invalid escape sequence \?
In fixing them I looked to your implementation and noticed you prefixed the expressions with r
so they are raw strings (presumably to fix the same warnings). However, I think this actually changed one of your regexes to something other than was intended.
# Before
$ '''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)'''
'(-+|~+|!+|"+|;+|\\?+|\\++|,+|\\)+|\\(+|\\+|\\/+|\\*+|\\[+|\\]+|}+|{+|\\|+|_+)'
#After
$ r'''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)'''
'(-+|~+|!+|"+|;+|\\?+|\\++|,+|\\)+|\\(+|\\\\+|\\/+|\\*+|\\[+|\\]+|}+|{+|\\|+|_+)'
$ '''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)''' == r'''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)'''
False
The switch to raw strings changed '|\\+'
(one or more backslashes) to |\\\\+
(two or more backslashes). I think you actually want the following regex.
r'''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)'''
$ r'''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)''' == '''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)'''
True