pytorch-openai-transformer-lm icon indicating copy to clipboard operation
pytorch-openai-transformer-lm copied to clipboard

Potentially incorrect regex in text_utils.py

Open schmmd opened this issue 5 years ago • 0 comments

Hi, we have some of your regex's in AllenNLP and Python has been warning us about them for a while.

https://github.com/huggingface/pytorch-openai-transformer-lm/blob/master/text_utils.py#L30

'''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)'''<input>:1: DeprecationWarning: invalid escape sequence \?
<input>:1: DeprecationWarning: invalid escape sequence \?
<input>:1: DeprecationWarning: invalid escape sequence \?
In [38]: '''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)'''
<input>:1: DeprecationWarning: invalid escape sequence \?
<input>:1: DeprecationWarning: invalid escape sequence \?
<input>:1: DeprecationWarning: invalid escape sequence \?
<ipython-input-38-9a7773b0447c>:1: DeprecationWarning: invalid escape sequence \?

In fixing them I looked to your implementation and noticed you prefixed the expressions with r so they are raw strings (presumably to fix the same warnings). However, I think this actually changed one of your regexes to something other than was intended.

# Before
$ '''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)'''
'(-+|~+|!+|"+|;+|\\?+|\\++|,+|\\)+|\\(+|\\+|\\/+|\\*+|\\[+|\\]+|}+|{+|\\|+|_+)'

#After
$ r'''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)'''
'(-+|~+|!+|"+|;+|\\?+|\\++|,+|\\)+|\\(+|\\\\+|\\/+|\\*+|\\[+|\\]+|}+|{+|\\|+|_+)'

$ '''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)''' == r'''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)'''
False

The switch to raw strings changed '|\\+' (one or more backslashes) to |\\\\+ (two or more backslashes). I think you actually want the following regex.

r'''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)'''

$ r'''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)''' == '''(-+|~+|!+|"+|;+|\?+|\++|,+|\)+|\(+|\\+|\/+|\*+|\[+|\]+|}+|{+|\|+|_+)'''
True

schmmd avatar Jan 09 '19 18:01 schmmd