dateparser
dateparser copied to clipboard
SECURITY: bad regex pattern in 'dateparser/languages/locale.py' will cause 'ReDos' security problem.
Problem description
i found one bad regex pattern in 'dateparser/languages/locale.py'
r'(?:[¡¿]+|[\.!?;…\r\n]+(?:\s|$))+', # Spanish
those pattern will cause 'ReDos' security problem, proof of code like below
import re
p = re.compile(r'(?:[¡¿]+|[\.!?;…\r\n]+(?:\s|$))+')
re.findall(p, "?"*100000+"x")
run the above code, cpu utilization will be 100% in a very long period.
more detail about 'ReDos' please see owasp.
effect of this security problem
some api will call the pattern,like below
from dateparser.search import search_dates
search_dates("?"*100000+"x")
Hi @leveryd
I tried running %time search_dates("x"*100000+"x")
(note that the first character is an "x") and I got:
CPU times: user 3min 27s, sys: 6.04 s, total: 3min 33s
Wall time: 3 min 38s
The CPU was at 99% :sweat_smile: but it didn't hang :stuck_out_tongue:.
Then I tried with your example, changing the first character to "?": %time search_dates("?"*100000+"x")
. And this is what I found:
CPU times: user 9min 23s, sys 9.61 s, total: 9 min 32s
Wall time: 9min 37s
So it seems that it's definitely having some impact, but it's not so critical as I initially thought.
After this I introduced the mentioned regex to this reDoS checker: http://redos-checker.surge.sh/ and it wasn't detected as a vulnerable regex (though it could be wrong).
Before finishing doing my tests, I tried this: %time search_dates("?"*10000+"x")
(4 zeroes) and I got:
Wall time: 4.54 s
and then I tried switching from "?" to "x" (%time search_dates("x"*10000+"x")
) and I got this:
Wall time: 20 s
That's definitely suspicious, and I'm not sure if it's a vulnerable regex or not...
I would appreciate any help from people with more experience with reDoS.
regexploit also reports this as exponential:
Redos(starriness=11, prefix_sequence=SEQ{ [27:'] }, redos_sequence=SEQ{ [21:!,2026,0a,0d,2e:.,3b:;,3f:?]{1+}{1+} [27:'] }, repeated_character=[21:!,2026,0a,0d,2e:.,3b:;,3f:?], killer=None)
Worst-case complexity: 11 ⭐⭐⭐⭐⭐⭐⭐⭐⭐⭐ (exponential)
Repeated character: [21:!,2026,0a,0d,2e:.,3b:;,3f:?]
Example: '\'' + '!' * 3456
btw: snyk reports this as vulnerability of this library: https://security.snyk.io/vuln/SNYK-PYTHON-DATEPARSER-1063229
Unfortunately my organization's new security policies have bumped into that SNYK vulnerability where we have a couple of projects that use this library. I won't claim to be a regexpert, but it seems like a quick fix might be to limit the quantifiers to a range one could "reasonably" expect as a sentence separator. E.g., instead of:
(?:[¡¿]+|[\.!?;…\r\n]+(?:\s|$))+
Something like:
(?:[¡¿]{1,50}|[\.!?;…\r\n]{1,50}(?:\s|$))+
(I can't imagine anyone using more than 50 punctuation marks to end a sentence but just having that limit seems to cut the processing time significantly)
Would that be a satisfactory resolution to get the SNYK report closed?
Also, regexploit no longer seems to report the original regex as vulnerable, I'm not sure what changed there.
Any updates on that?