codeql
codeql copied to clipboard
C++: Implement models-as-data
Implement models-as-data for C++. That is, support for CSV formatted flow sources, sinks and summaries that look something like this:
";;false;getc;;;ReturnValue;remote",
The implementation is ported from Swift, and uses the shared MAD library to do the heavy lifting. I've created a range of "synthetic" tests (that is, tests that use models defined in the tests), and also created "real" sources for getc and friends (which have corresponding "real" tests).
This is currently a draft PR, and is missing most of the results it should find in tests. Things I need to do:
- [x] address
TODO: fix this, there's no good reason for it.. - [ ] finish implementation of
encodeContent. - [x] get sinks working
- [ ] and add / convert a query sink to MAD?
- [x] get summaries working
- [ ] and add / convert a summary model to MAD?
- [ ] get indirection working
- syntax TBD
- [ ] address any other missing test cases?
- [ ] check performance on DCA
- [x] there are a couple of “pre-requisites” we expect to need to fix here
- [ ] add change note(s)
@MathiasVP I would appreciate an early review and constructive feedback, when you're available.
Thanks for the early review @MathiasVP . It's going to take me a bit of time to address everything, along with the known issues in this PR.
I’ve fixed TInterpretNode by removing the TDataFlowCall_ case - which is redundant as the Element case can already represent a call, and that ambiguity was causing stuff to get lost. Fixed up asCall and getCalltarget in that class. We get more results now, and we don’t need the bodge in shared/dataflow any more.
We do, however, introduce a new bodge in TReturnKind, which I believe will is a symptom of not having addressed the ‘pre-requisite’ tasks yet, so I’ll be addressing that next.
Just added four commits. Broadly:
- incorporated changes by Mathias which make dataflow
ArgumentPosition/ParameterPositionandReturnKindno longer depend onNode. This was referred to in the PR description above as "pre-requisites", and we expect to need it here soon. - fixed the issue with missing
TReturnKinds in one of the tests. - tried to document a few things better.
@MathiasVP I'd like your opinion on the last commit in particular, which modifies TReturnKind again. I think this is what I want but I'm not entirely confident with the predicates it's built atop, nor whether using Function.getUnspecifiedType() (even via Ssa::Function) is OK in data flow. If it's not, I suppose I could create an SSA wrapper / abstraction for what I need?
@MathiasVP I'd like your opinion on the last commit in particular, which modifies
TReturnKindagain. I think this is what I want but I'm not entirely confident with the predicates it's built atop, nor whether usingFunction.getUnspecifiedType()(even viaSsa::Function) is OK in data flow. If it's not, I suppose I could create an SSA wrapper / abstraction for what I need?
Thanks for pushing forward with this, Geoffrey! I've left some comments now about the last commit. If anything isn't clear please do speak out.
Just pushed 7 commits, implementing summary flow. Only a couple of FlowSummaryNode types work at the moment, as the new test shows, and that's a problem I haven't figured out yet. It's very difficult to debug. It could have something to do with any of the spread of TODOs in the work at the moment now that I look at it again.
Next I'm going to try and address some more of the existing PR comments (above).
Just pushed an implementation of SummaryOutNode, and I'm finally happy with the state of the FlowSummaryNode class / test overall.
The major feature that's left is indirection support, and that will fix a lot (not all) of the missing results in tests. I guess I'll make a couple of proposals for the syntax for that and get a discussion going.
I think all of the test cases involving taint flow to the qualifier ("self") fail as well, let me know if you have any idea what might be missing / broken there. Actually it might just be that I've forgotten to finish one of the things on the TODO list you sent me a few weeks ago ... I'll check through that tomorrow as well.
I've just pushed two commits that add some more test cases, then fix the issue where argument -1 was the qualifier address, rather than the qualifier object, by adding the indirection value. However this has uncovered another issue - summaries have stopped working in some cases where the qualifier is a temporary object. I believe this may be an inconsistency in that we get the qualifier object in these cases rather than the qualifier address (so specifying 1 level of indirection breaks flow instead of fixing it). I'm not sure where that's coming from, my guess is that one of our inputs to the MAD library is slightly off.
I'm not sure where that's coming from, my guess is that one of our inputs to the MAD library is slightly off
It may also be possible that this is a problem in our dataflow node <> expression mapping (and thus unrelated to MaD). I can investigate that tomorrow.
I'm not sure where that's coming from, my guess is that one of our inputs to the MAD library is slightly off
It may also be possible that this is a problem in our dataflow node <> expression mapping (and thus unrelated to MaD). I can investigate that tomorrow.
It was indeed such a problem. I've opened https://github.com/github/codeql/pull/15918 with a fix. Once we merge that PR into main and you merge main into this PR those results will reappear 🎉
I just pushed another batch of changes:
- indirection support (using the
Argument[*0]style syntax in the end). - more test cases (and fixed a couple of mistakes in existing test cases).
I think we're at the point where the features people will commonly use are all working. We're missing:
- indirection for sources + sinks.
- parameters as sources.
- qualifier fields as sources and sinks.
- flow models for global variables.
- function pointers.
I don't think we necessarily need to fix all of those right away, but I want to look into a couple of them a bit more and I'd appreciate any intuitions about what might be wrong. There are also still some unresolved discussions above, I intend to have another look at these this afternoon.
I just an early DCA run on this branch to see how things are looking (another will surely be wanted before we merge):
- new alert for
cpp/uncontrolled-allocation-sizeonvim__vim; the source is a call togetc, one of the new models-as-data sources, and it appears to be a correct result. :+1: - three new alerts or wobble for
cpp/unbounded-writeonnelson; again the sources are calls togetc. :+1: - "String cache size, per source" increases; I'm unsure if this is wobble or a real problem introduced by some part of the PR. I plan to run DCA again on the final version, if this is wobble it shouldn't show up again. ❓
- no significant performance change was observed.
- "String cache size, per source" increases; I'm unsure if this is wobble or a real problem introduced by some part of the PR. I plan to run DCA again on the final version, if this is wobble it shouldn't show up again. ❓
I don't think this is wobble. I think using MaD simply means a lot more string operations (because of all the string splitting done internally by the library). It may also be related to https://github.com/github/codeql/pull/15371/commits/76780d74d96e754a18cdfd98b03962c622f95a99, but I'm not too concerned about this
As far as I can tell, the still failing tests divide up as follows:
- five source/sink tests involving indirection.
- one sink test involving a parameter (rather than an argument).
- four summary field-to-field cases.
- two source/sink tests involving qualifier fields.
- five summary tests involving function pointers.
I'm going to hypothesize that none of the remaining source/sink cases can be fixed due to limitations in the shared library, namely that as I understand it, only very simple access paths are supported for sources / sinks.
As for the function pointers, I'm not sure. The summary nodes and summary calls all seem to be there. I don't know if you have any ideas, but I need someone to discuss this with.
There's been a lot of discussion and iteration on this PR which I think is now approaching completion. For the sake of everyone reading this I want to draw a line here:
I believe everything discussed above this line has been satisfactorily addressed. Please comment if you disagree or think I've missed anything. There are also seven potential follow-up issues (linked from the internal issue for this PR) mostly tracking the remaining MISSING results in the main test.
I'm aware that there is a merge conflict, and a possible future merge conflict with https://github.com/github/codeql/pull/15501 . Change notes need to be written, most likely some autoformatting needs to happen to make CI happy, and I should do a repeat DCA run.
I've just added the change notes and fixed the other minor issues. This PR is now ready for review. I will shortly begin a second DCA run.
The second DCA run:
- was technically a failure because two projects failed to build; caused by problems outside of this PR.
- overall analysis time is 2.8% slower, a little more than 0.8% on the first run; probably a mix of wobble and genuinely doing more work.
- string cache size increase of 37.4% (same as the first run). We agreed on the previous run that it probably comes from the shared library and was not too concerning.
- 4 new query results (same as the first run, all good results).
So there's probably a small performance loss, not unacceptable I'd say, but if there are any leads for possible improvements I would like to follow them. We may well be better off doing general performance work in future rather than pursuing tiny gains here.
I also ran a Swift DCA experiment, which was uneventful, unsurprisingly as the Swift changes in this PR are very minor (they just mirror CPP changes in a few places, mostly comments).
I've fixed the CI issues (hopefully) and the merge, though alert provenance is not properly computed (I've created another follow-up issue to cover that).
Ready for final approval + merge.