dateparser icon indicating copy to clipboard operation
dateparser copied to clipboard

SECURITY: bad regex pattern in 'dateparser/languages/locale.py' will cause 'ReDos' security problem.

Open leveryd opened this issue 3 years ago • 4 comments

Problem description

i found one bad regex pattern in 'dateparser/languages/locale.py'

r'(?:[¡¿]+|[\.!?;…\r\n]+(?:\s|$))+',  # Spanish

those pattern will cause 'ReDos' security problem, proof of code like below

import re
p = re.compile(r'(?:[¡¿]+|[\.!?;…\r\n]+(?:\s|$))+')
re.findall(p, "?"*100000+"x")

run the above code, cpu utilization will be 100% in a very long period.

more detail about 'ReDos' please see owasp.

effect of this security problem

some api will call the pattern,like below

from dateparser.search import search_dates

search_dates("?"*100000+"x")

leveryd avatar Jan 21 '21 02:01 leveryd

Hi @leveryd

I tried running %time search_dates("x"*100000+"x") (note that the first character is an "x") and I got:

CPU times: user 3min 27s, sys: 6.04 s, total: 3min 33s
Wall time: 3 min 38s

The CPU was at 99% :sweat_smile: but it didn't hang :stuck_out_tongue:.

Then I tried with your example, changing the first character to "?": %time search_dates("?"*100000+"x"). And this is what I found:

CPU times: user 9min 23s, sys 9.61 s, total: 9 min 32s
Wall time: 9min 37s

So it seems that it's definitely having some impact, but it's not so critical as I initially thought.

After this I introduced the mentioned regex to this reDoS checker: http://redos-checker.surge.sh/ and it wasn't detected as a vulnerable regex (though it could be wrong).

Before finishing doing my tests, I tried this: %time search_dates("?"*10000+"x") (4 zeroes) and I got:

Wall time: 4.54 s

and then I tried switching from "?" to "x" (%time search_dates("x"*10000+"x")) and I got this:

Wall time: 20 s

That's definitely suspicious, and I'm not sure if it's a vulnerable regex or not...

I would appreciate any help from people with more experience with reDoS.

noviluni avatar Jan 27 '21 08:01 noviluni

regexploit also reports this as exponential:

Redos(starriness=11, prefix_sequence=SEQ{ [27:'] }, redos_sequence=SEQ{ [21:!,2026,0a,0d,2e:.,3b:;,3f:?]{1+}{1+} [27:'] }, repeated_character=[21:!,2026,0a,0d,2e:.,3b:;,3f:?], killer=None)
Worst-case complexity: 11 ⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐ (exponential)
Repeated character: [21:!,2026,0a,0d,2e:.,3b:;,3f:?]
Example: '\'' + '!' * 3456

btw: snyk reports this as vulnerability of this library: https://security.snyk.io/vuln/SNYK-PYTHON-DATEPARSER-1063229

sebix avatar Nov 21 '21 18:11 sebix

Unfortunately my organization's new security policies have bumped into that SNYK vulnerability where we have a couple of projects that use this library. I won't claim to be a regexpert, but it seems like a quick fix might be to limit the quantifiers to a range one could "reasonably" expect as a sentence separator. E.g., instead of:

(?:[¡¿]+|[\.!?;…\r\n]+(?:\s|$))+

Something like:

(?:[¡¿]{1,50}|[\.!?;…\r\n]{1,50}(?:\s|$))+

(I can't imagine anyone using more than 50 punctuation marks to end a sentence but just having that limit seems to cut the processing time significantly)

Would that be a satisfactory resolution to get the SNYK report closed?

Also, regexploit no longer seems to report the original regex as vulnerable, I'm not sure what changed there.

Daveography avatar Jul 06 '22 20:07 Daveography

Any updates on that?

macejiko avatar Aug 31 '22 11:08 macejiko