How can we avoid ReDoS without trust on PCRE limits
PCRE is one of the most popular regex libraries available out there. It is heavily used in ModSecurity although it may be optional on 3.1 where Hyperscan and RE2 are likely to be added to the soup as well.
ModSecurity assumes that the one who is written the rules are not an attacker, rather someone that is trying to protect the application. Unfortunately, some times, not-so-experienced users may have a problem writing the rules which leads to undesired or unexpected behavior.
A clear example of that is while the user accesses a variable that is not yet set on a given phase; Assuming that it was filled and had an empty value. Another example is while the user accesses a key that wasn't set on a collection, leading to always having an empty value. Those will be addressed soon on v3 error handler — blog post on its way.
Other examples were already fixed on v3 as part of the usability up taken:
- SpiderLabs/owasp-modsecurity-crs#995 - Uses MULTIPART_MISSING_SEMICOLON instead of MULTIPART_SEMICOLON_MISSING
- SpiderLabs/owasp-modsecurity-crs#559 - Missing quote at the end of the rule 920200
- SpiderLabs/owasp-modsecurity-crs#558 - Missing quote at the end of the rule 920201
- SpiderLabs/owasp-modsecurity-crs#557 - Missing quote at the end of the rule 920450
- SpiderLabs/owasp-modsecurity-crs#556 - Missing quote at the end of the rule 920202
- SpiderLabs/owasp-modsecurity-crs#555 - Adds missing comma in the action list of rule #912120
Those may lead to unexpected behavior on v2 and silent fail.
On #2071 a subject that was discussed back on 2013 reappear in the surface [#267]. Should ModSecurity apply restrictions on the regex match? The implications are described at #2071.
To illustrate the problem we have: #56, #267, #1176, #1481, and #1290
The main objective of this issue is to incite a discussion on how to handle it well, focused on the regular user.
I would also like to mention that @s0md3v was the one that brought our eyes to this very problem. Thank you @s0md3v and @airween for open the pull request #2071.
There are two popular regex algorithms:
- NFA
- DFA
Key Differences
Handling alternatives
Regex: (b|c|m)at
String: cat
- NFA favors first alternative. It starts with the first one and skips the other alternatives when a match is found and hence it will stop at
cat. An ideal exploit string will have the last alternative to make sure the engine has to go through all the alternatives. - DFA tries to match all of them and then selects the longest match and hence it will try to match
bat,catandmati.e. all possibilities.
Possessiveness
Regex: a(b)?(bc)?
String: abc
- NFA is possessive, it doesn't want to give up what it has already matched. It sees that
amatches, then it moves tobwhich also matches but thenbcfails to match but that's okay because it's optional and hence the result isab. - DFA goes for the longest match possible. After the
amatches, it sees that bothbandbccan be matched and hence it goes for the longest match possible i.e.abc
Exploitation strategies
-
NFA: Write a string that matches everything except the last token so it keeps backtracking.
-
DFA: Give it too many permutations to consider.
Since DFA will consider all the possibilities anyway, a bad regular expression will always eat up memory (not time) irrespective of the input.
NFA on the other hand, performs fine unless a specially crafted string is passed to the engine.
I am not sure what are your takeaways from this but I believe NFA is better, DFA doesn't even support lazy/greedy matching. Mod_Security currently uses a NFA one.
How to spot a bad regular expression?
Nested repetition operators
Regex: (?:/[dflr].*)*
This is a sub-pattern that I found in CRS. Notice how one repetition operator * is being used over another? That's bad.
Intersecting patterns (Remote)
Regex: .*(\d+)
As you might has guessed, it is looking for <anythng>(<more than one digits>).
Now the problem is, .* matches the strings matched by (, \d+ and ).
For a string, function(1234)
NFA will first match function(1234) with .* but (\d+) fails to match so it goes one step back and matches function(1235 but then the remaining ) fails to match (\d+) and keeps going backwards. See the problem?
.* should stop matching once an ( is encountered but it doesn't because NFA is greedy by default.
Intersecting patterns (Adjacent)
Regex: \d+.*
This is similar to the previous one but this time adjacent patterns are intersecting.
Intersecting patterns (Alternative)
Regex: (?:^[\"'`\\\\]*?[^\"'`]+[\"'`])+|(?:^[\"'`\\\\]*?[\d\"'`]+)+
This is another bad regular expression that I found in CRS.
Both alternate patterns start with ^[\"'`\\\\]*? which causes the regex engine to keep looking for both patterns and hence increasing the permutations.
In the second alternate pattern, the tokens [\"'`\\\\]*? and [\d\"'`]+ intersect and match ", ' and `.
On top of that, nested repetition operators are being used.
Conclusion
I do not support automation for checking if ReDOS exists. It's better if the maintainers can see if the rules do not have the properties above. You can also mention me on a PR to take a look, I am hyperactive on Github.
Hi @zimmerle
thanks for reply here too :).
The main objective of this issue is to incite a discussion on how to handle it well, focused on the regular user.
Do you think about for example check the regular expressions when ModSec starts, and notify user if there is a potential wrong syntax? Or what? Could you show some example?
Hi @zimmerle,
I am the Hyperscan maintainer and developer.
Hyperscan delivers high performance for both multiple literal matching and regex matching, so it could naturally replace Aho Corasick algorithm and PCRE to get performance boost. We conduct performance comparisons in our paper at https://www.usenix.org/system/files/nsdi19-wang-xiang.pdf.
There is also a published whitepaper at https://www.hyperscan.io/2020/09/28/optimize-azure-cloud-security-with-intel-hyperscan/ for Microsoft Azure WAF that is now accelerated by Hyperscan.
Any plan or progress of adding Hyperscan as a regex matching option? We would like to collaborate if any help is needed.
Thanks, Xiang
Hi @xiangwang1,
We definitely want to support Hyperscan :fireworks:.
To make the transition smoothly to our users, we want to have a compilation flag that allows the user to change the regular expression engine to whatever they feel comfortable with, including Hyperscan.
On issue #2012, the researcher @WGH- offered an additional implementation for RE2. One of the reasons we hold merging the pull request was that we didn't have this selection mechanism available yet.
The implementation for the support of Hyperscan as a regular expression engine should be considered easy in terms of effort. All regular expression usage inside ModSecurity is limited to the class Regex -
https://github.com/SpiderLabs/ModSecurity/blob/50fc347ed4d0852d54561e15559de07c698fbb16/src/utils/regex.h#L33-L91
Populating those methods with the correct interface for Hyperscan will do the job.
The Aho Corasick replacement could be used as the kernel for the @pm operator. It currently uses this Aho Corasick-ish implementation that we are more than welcome to replace with something more performant and/or use less memory.
@xiangwang1 do you think you can help us to fill the Regex class to use the Hyperscan ? :construction_worker:
Hi @zimmerle,
Since current ModSecurity rules are written for PCRE support only, one problem with replacing PCRE with Hyperscan or RE2 is lack of full syntax support such as zero-width assertion, back reference, etc. Do you have any idea or plan to address this issue?
If we stick to PCRE syntax, then we can't achieve 100% replacement. Here is what we do for PCRE replacement in other systems like Snort and Suricata:
- Use Hyperscan prefiltering mode to transform natively unsupported rules to supported ones with a superset of matches.
- Use Hyperscan as the filter and only trigger PCRE match if there are any matches found by Hyperscan.
There are also good slides about this for Rspamd.
It's easy to replace Aho Corasick with Hyperscan as they share the same match behavior. This is the major performance boost we get in Snort and Suricata. In general, the more literal rules we have, the more performance boost and memory space saving from Hyperscan. We may think about this replacement as the first step. 😊
Just an idea for introduction of Hyperscan (and RE2 too) instead of replacing of PCRE.
It would be better to use a new operator (similar to rxGlobal): if somebody wants to use Hyperscan compatible expressions, then it would be the @rxHyper, for RE2 would be @re2.
Another solution would be that user can decide in compilation time (I mean ./configure --with-regex-engine=pcre|re2|hyperscan) which regex engine he/she wants to use.
The list of regex engines can be expanded with pcre2.
@xiangwang1 : In the prefiltering mode is the construction of the superset of matches automatic, or is this pattern something that a user needs to provide?
If it's automatic, does the following work in practice:
- PCRE specific pattern: prefiltering mode constructs superset, hyperscan is executed, superset match, pcre is invoked, original pattern match is executed, pcre match is being returned.
That would mean one can submit PCRE specific patterns and treat the engine(s) like a black box that returns pcre matches.
This would of course be very welcome to projects like the OWASP ModSecurity Core Rule Set that is being used in the Azure WAF as well, if I am not mistaken.
RE2 is largely compatible with PCRE. My pull request (#2012) is built on idea that the patterns that can't be compiled on RE2 can use PCRE as a fallback.
Also, some people apparently use CRS outside of ModSecurity, and contributed fixing the patterns to make them RE2-compatible, where it's possible.
Hyperscan, however, is a different beast. It shines when you match a set of multiple patterns against the same input. It would require some major changes in ModSecurity code to do that. At the first glance, you would have to group all rules that use regular expression operator against the same transformation/variable, and make a single regular expression "database" (that's what they call it) for it, and basically evaluate all correspondng operators in one go.
(...)
It's easy to replace Aho Corasick with Hyperscan as they share the same match behavior. This is the major performance boost we get in Snort and Suricata. In general, the more literal rules we have, the more performance boost and memory space saving from Hyperscan. We may think about this replacement as the first step.
Sounds like a plan :rocket: :+1:. That seems to be an easy improvement with a real benefit to our users. I think we can start to work on that one as the change is more stragithforward.
As a roadmap I would consider -
- [ ] Having the hyperscan detected on the build system.
- [ ] Make sure that hyperscan info is shown on the configure.
- [ ] Having a definition set, such as: -DWITH_HYPERSCAN conditional to the existence of hyperscan.
- [ ] Having an option --keep-old-acpm-tree to avoid the replacement of the old Aho, this may set: -DUSING_OLD_ACMP_TREES.
- [ ] Having hyperscan as a replacement of Aho if USING_OLD_ACMP_TREES is not set.
- [ ] Make sure this new compilation was added to the QA compilation matrix.
- [ ] Make sure all test from QA are passing.
- [ ] Have it ready for shipment on the selected version (TBD).
- [ ] Blog post and media in coordination with the hyperscan community.
Idea for Configure summary
ModSecurity - v3.0.4-112-gfc75b5c9 for Linux
Mandatory dependencies
+ libInjection ....v3.9.2-46-gbfba51f
+ SecLang tests ....a3d4405
Optional dependencies
+ GeoIP/MaxMind ....found
* (MaxMind) v
/usr/lib//libmaxminddb.so, /usr/include, -DWITH_MAXMIND -I/usr/include
* (GeoIP) v1.6.12
-lGeoIP , -I/usr/include/
+ LibCURL ....found v7.74.0
-lcurl, -DWITH_CURL_SSLVERSION_TLSv1_2 -DWITH_CURL
+ YAJL ....found v2.1.0
-lyajl , -DWITH_YAJL -I/usr/include/yajl
+ LMDB ....disabled
+ LibXML2 ....found v2.9.10
-lxml2 -lz -llzma -licui18n -licuuc -licudata -lm -ldl, -I/usr/include/libxml2 -DWITH_LIBXML2
+ SSDEEP ....found
-lfuzzy -L/usr/lib/, -DWITH_SSDEEP -I/usr/include
+ LUA ....found v503
-llua5.3 -L/usr/lib/, -DWITH_LUA -DWITH_LUA_5_3 -I/usr/include/lua5.3
+ Hyperscan ... found vX.Y.Z
-lhyperscan, -DWITH_HYPERSCAN
Other Options
+ Test Utilities ....enabled
+ SecDebugLog ....enabled
+ afl fuzzer ....disabled
+ library examples ....enabled
+ Building parser ....disabled
+ Treating pm operations as critical section ....disabled
+ Using Hyperscan instead of Aho Corasick (pm) ....enabled
Notice -
+ Hyperscan ... found vX.Y.Z
-lhyperscan, -DWITH_HYPERSCAN
+ Using Hyperscan instead of Aho Corasick (pm) ....enabled
QA Compilation matrix
https://github.com/SpiderLabs/ModSecurity/blob/50fc347ed4d0852d54561e15559de07c698fbb16/.github/workflows/ci.yml#L15-L23
Currently ACMP implementation
https://github.com/SpiderLabs/ModSecurity/blob/50fc347ed4d0852d54561e15559de07c698fbb16/src/utils/acmp.h#L28-L195
Where the ACMP implementation is used?
Initialization
https://github.com/SpiderLabs/ModSecurity/blob/50fc347ed4d0852d54561e15559de07c698fbb16/src/operators/pm.cc#L116-L151
Run time evaluation
https://github.com/SpiderLabs/ModSecurity/blob/50fc347ed4d0852d54561e15559de07c698fbb16/src/operators/pm.cc#L84-L113
Regular expression engine replacement
RE2 is largely compatible with PCRE. My pull request (#2012) is built on idea that the patterns that can't be compiled on RE2 can use PCRE as a fallback.
Although I like @WGH-'s run time fallback option, I most say that it could be confusing while supporting n+2 engines. I would go for the compilation option. One option may include the combination of RE+RE2 as suggested on #2012.
Since current ModSecurity rules are written for PCRE support only, one problem with replacing PCRE with Hyperscan or RE2 is lack of full syntax support such as zero-width assertion, back reference, etc. Do you have any idea or plan to address this issue?
@xiangwang1 Compatibility is with older versions is important, however, we understand that sometimes - in order to evolve - we need to avoid legacy. Specially if there are some technical debts involved.
Having said that, as long as we keep the fallback option (whatever it is: run time and/or compilation time) and consider it as addition to the correct version, we should be fine to go. The main goal should be on performance and usability. To cite other projects approaches on that matter -
- Snort++
- https://readlist.com/lists/lists.sourceforge.net/snort-users/4/22335.html
- https://blog.snort.org/search/label/Hyperscan
- Later on Snort 2.9.x
- https://01.org/node/4298
- Suricata
- https://suricata-ids.org/2016/04/04/suricata-3-0-1-released/
As mentioned by @WGH-
Also, some people apparently use CRS outside of ModSecurity, and contributed fixing the patterns to make them RE2-compatible, where it's possible.
In terms of CRS, Microsoft's waf team may want to share their experience (CC: @allanbomsft, @alonzop !?!?). But, I would be concerned about avoiding road blocks for the users on ModSecurity engine, eventually they update accordingly.
I am considering this change as an addition to v3.1, where significant changes are expected anyway.
The impact on the changes may be clear with the existence o a PoC. Is that something that we can consider @xiangwang1?
The impact on the changes may be clear with the existence o a PoC. Is that something that we can consider @xiangwang1?
Yes, I think we can leverage the existing PoC as starting point. I personally prefer the compilation option to let users can select the regex engine (PCRE, RE2 and Hyperscan) based on their configured rules and requirements (performance and usability v.s. syntax support).
Closing this due to the recent merge of #2736 .
(The conversation in this item has covered some separate matters; individuals should feel free to open separate issues for these as desired.)