codeql
codeql copied to clipboard
Refactorizations of the ReDoS libraries
Many things are happening in this PR, all of which are outlined below.
Commit-by-commit review is highly recommended.
All the tests should pass on every commit.
~~1: Disable parts of QL-for-QL~~
~~The consistency and ql/dead-code
queries are very noisy until we have proper support for parameterized modules.
So I've disabled those.~~
No longer needed now that https://github.com/github/codeql/pull/9575 has been merged.
2: Introduce a ReDoSPruning
parameterized module
The class Config extends string
pattern used previously was ugly.
This is just a straight-forward conversion of that pattern into a parameterized module.
3: Implement a more efficient algorithm for computing strings from chains
Based on @hvitved's work: https://github.com/github/codeql/pull/8589
But moved into a parameterized module in order to share more code between exponential and superlinear ReDoS.
It does the same thing as @hvitved's implementation, but in a slightly different way (to make a pretty signature module).
4: add further normalization of char classses
Inspired by @esbena: https://github.com/github/codeql/pull/4847#discussion_r547244758
Now \d
and [\d]
and [0-9]
will literally be represented with a single value.
(an arbitrary one of them is chosen as the canonical representative).
The extra normalization has a real, but limited, impact.
The largest impact is on the size of the SuperlinearBackTracking::getAThreewayIntersect
predicate, the size of that predicate drops by ~15% on large JS benchmarks.
I also drive-by fixed a bug related to case-normalization.
(The bug didn't affect anything before the changes in this PR).
5: make a parameterized module out of the RegexpMatching implementation
I didn't like the configuration pattern used previously, which was also why I hadn't made a real library out of it.
It can be better with parameterized modules, so I did that.
6: Rename ReDoSUtil
to NfaUtils
, and rename the performance
folder to regexp
.
The ReDoSUtil
file was used for more than just ReDoS: it's a general library for constructing and reasoning about NFAs.
And the performance
folder was also a bad name, so I renamed it to regexp
.
7: Improve performance.
Lots of states in the process()
predicate could be eliminated.
Because we know that if a regexp ever "reaches" the accept-all state, then it can never be rejected, and thus we can skip them early.
I've also ported this exponential-redos fix from Shack to the superlinear library.
Evaluations.
All look good, and mostly show a significant speedup (except for Python, which has neutral performance).
JavaScript, Ruby, Java, Python.
Previous version of this PR, which got filled up with noise: https://github.com/github/codeql/pull/8522
Ping @ginsbach : Parameterised modules in here.
Thanks for pinging. The use of parameterised modules here looks good to me, no concerns!
I've rebased on the latest main
now that QL-for-QL supports parameterized modules: https://github.com/github/codeql/pull/9575
I've also re-enabled the QL-for-QL consistency- and dead-code-query in this PR.
I did a rebase on main
to resolve the conflict with https://github.com/github/codeql/pull/9618.
If I understood correctly, NfaUtils is still an old-style Pyramaterised module. Any plans to change that?
Yes. But doing that requires features that we don't have yet (nested modules, and better support for classes in parameterized modules).
The entire API for the tree-structure of an regular-expressions needs to become a signature.
I didn't review code that appears to have been mostly moved, and for now I've only skimmed the commit which "simplifies recursion" as it wasn't obvious from the description here what's really going on.
It's basically just a port of Toms performance fix in ExponentialBackTracking.qll
: https://github.com/github/codeql/pull/8589
I did a merge with main
to fix a merge conflict.
@github/codeql-python can I get a review?
@yoff I think I've addressed your review comments, and I also addressed some additional documentation issues identified by QL-for-QL.
The remaining QL-for-QL alerts are false positives that will be fixed by https://github.com/github/codeql/pull/9884
I've started new evaluations for JavaScript, Python, Java, and Ruby.
Yep, looks great :-)
I've started new evaluations for JavaScript, Python, Java, and Ruby.
The evaluations are all looking good.