codeql
codeql copied to clipboard
JS: Improve handling of spread arguments and rest parameters [shared data flow branch]
(Targeting js/shared-dataflow-branch)
This greatly improves our ability to handle spread arguments and rest parameters, as well as .apply() calls and the arguments array.
The basic idea is that each call materialises an array of arguments using a bunch of store edges, which is read back using read edges at the callee. But as usual there are a few complications which I'll talk about below
Split into static and dynamic argument arrays
The initial pruning phase of data flow does not track contents at all. So if we always pass arguments via contents, then all argument positions will get mixed up during the initial forward pass. This would happen even for call sites and functions that don't actually use spread/rest.
To ensure "simple" call sites and functions are handled precisely in the initial phase, we pass in two argument arrays, one for "static" arguments and one for "dynamic" arguments.
Arguments passed in via spread arguments or .apply() are called "dynamic arguments". Rest parameters and the arguments object are said to be "dynamic parameters".
- At each call site:
- Arguments passed in via a spread operator, or appear after a spread operator, are stored in the dynamic argument array.
- The second argument of
.apply()flows directly into the dynamic argument array. - All other positional arguments are stored in the static argument array.
- At each function:
- Contents from the dynamic parameter array flow into the plain positional parameters.
- Contents from the static and dynamic parameter arrays flow into rest parameters and
arguments.
Shifting indices
When a rest parameter or spread argument appears at an index >0 we need to shift array indices. This currently requires synthesising a bunch of nodes and store/read edges for the first 10 array indices. This is similar to how Ruby does it.
When a rest parameter or spread arguments appears at index 0, we just generate a local flow step instead so we avoid a redundant zero-shift.
Super calls in default constructors
Part of the reason for doing this is that we have better support for flow through super() calls, and since default constructors desugar into code containing a super(...args) call, we get spurious flow if we don't also have precise handling of spread arguments and rest parameters. Tbh I was expecting a bigger impact from this; in the end only 3 FPs were fixed, but still worth doing now that the work is done.
Removing the implicit taint rule for arrays
With the old data flow library we mostly followed by the rule that if any element of an array is tainted, then the whole array is tainted. This rule has been gradually relaxed as we've added support for precise tracking of array contents, but a few holdovers remain, such as array literals being tainted if any of their members were tainted.
Taint tracking just doesn't play nice with rest/spread if we keep that rule, so I'm moving towards removing it in this PR. To avoid breaking everything however, I've added array element contents to defaultImplicitTaintRead, meaning that any sink or configuration-specific additional taint step can implicitly read array elements. Note that only configuration-specific taint steps can use the implicit read, steps from library models do not, so we may have some work to do in updating our models. But that's for a future PR.
Note that there is a known bug in how implicit reads interact with use-use flow, and we get use-use flow from the capture library in some cases. I'm working on a fix for that in a separate PR. This DCA run shows the effect of merging that PR into this branch, and it shows a significant speed-up for vscode.
Performance
- Evaluation 1 shows neutral performance except for three projects, with the worst being a 200% regression in vscode.
- Evaluation 2, with @hvitved's store/load matching PR merged in on both sides, shows much less severe regression. It's still 10% in the worst case, however. (Note: this was based on an earlier version of both PRs)
- Evaluation 3 shows the effect of merging this PR along with PR: tracking AP length in stage 2, which recovers the performance and then some.
Although the potential performance fixes are still in flux, I think it would make sense to get this PR through review.
Results
Results from the above two evaluations show:
- 32 new TPs
- 30 new TPs from detecting prototype pollution in various copies of the
$.extend()from a vulnerable version of jQuery. - 1 new TP from
__dirnameused in a shell command without escaping. - 1 new TP from
js/xss-through-domthough not sure if exploitable
- 30 new TPs from detecting prototype pollution in various copies of the
- 3 fixed FPs, due to spurious flow through spread/rest parameter where indices got messed up
- 5 new FPs (since we discover more flow paths in general)
- 2 where a URL redirection is prevented by
encodeURIComponentwhich is not currently a sanitiser. - 1 due to call/return mismatch after flow into
a.hrefwhere theaelement is shared in a captured variable, but data can't propagate across calls. - 2 due to complex code: a small templating engine and one flowing on the wrong side of a
goog.isString(x)check
- 2 where a URL redirection is prevented by
- 3 new "low quality" alerts where the data flow path is valid, but the query seems too broad or it goes through a test case