codeql icon indicating copy to clipboard operation
codeql copied to clipboard

C++: Implement models-as-data

Open geoffw0 opened this issue 1 year ago • 6 comments

Implement models-as-data for C++. That is, support for CSV formatted flow sources, sinks and summaries that look something like this:

    ";;false;getc;;;ReturnValue;remote",

The implementation is ported from Swift, and uses the shared MAD library to do the heavy lifting. I've created a range of "synthetic" tests (that is, tests that use models defined in the tests), and also created "real" sources for getc and friends (which have corresponding "real" tests).

This is currently a draft PR, and is missing most of the results it should find in tests. Things I need to do:

  • [x] address TODO: fix this, there's no good reason for it..
  • [ ] finish implementation of encodeContent.
  • [x] get sinks working
    • [ ] and add / convert a query sink to MAD?
  • [x] get summaries working
    • [ ] and add / convert a summary model to MAD?
  • [ ] get indirection working
    • syntax TBD
  • [ ] address any other missing test cases?
  • [ ] check performance on DCA
    • [x] there are a couple of “pre-requisites” we expect to need to fix here
  • [ ] add change note(s)

@MathiasVP I would appreciate an early review and constructive feedback, when you're available.

geoffw0 avatar Jan 18 '24 13:01 geoffw0

Thanks for the early review @MathiasVP . It's going to take me a bit of time to address everything, along with the known issues in this PR.

geoffw0 avatar Jan 23 '24 12:01 geoffw0

I’ve fixed TInterpretNode by removing the TDataFlowCall_ case - which is redundant as the Element case can already represent a call, and that ambiguity was causing stuff to get lost. Fixed up asCall and getCalltarget in that class. We get more results now, and we don’t need the bodge in shared/dataflow any more.

We do, however, introduce a new bodge in TReturnKind, which I believe will is a symptom of not having addressed the ‘pre-requisite’ tasks yet, so I’ll be addressing that next.

geoffw0 avatar Jan 24 '24 18:01 geoffw0

Just added four commits. Broadly:

  • incorporated changes by Mathias which make dataflow ArgumentPosition/ParameterPosition and ReturnKind no longer depend on Node. This was referred to in the PR description above as "pre-requisites", and we expect to need it here soon.
  • fixed the issue with missing TReturnKinds in one of the tests.
  • tried to document a few things better.

@MathiasVP I'd like your opinion on the last commit in particular, which modifies TReturnKind again. I think this is what I want but I'm not entirely confident with the predicates it's built atop, nor whether using Function.getUnspecifiedType() (even via Ssa::Function) is OK in data flow. If it's not, I suppose I could create an SSA wrapper / abstraction for what I need?

geoffw0 avatar Feb 01 '24 17:02 geoffw0

@MathiasVP I'd like your opinion on the last commit in particular, which modifies TReturnKind again. I think this is what I want but I'm not entirely confident with the predicates it's built atop, nor whether using Function.getUnspecifiedType() (even via Ssa::Function) is OK in data flow. If it's not, I suppose I could create an SSA wrapper / abstraction for what I need?

Thanks for pushing forward with this, Geoffrey! I've left some comments now about the last commit. If anything isn't clear please do speak out.

MathiasVP avatar Feb 01 '24 18:02 MathiasVP

Just pushed 7 commits, implementing summary flow. Only a couple of FlowSummaryNode types work at the moment, as the new test shows, and that's a problem I haven't figured out yet. It's very difficult to debug. It could have something to do with any of the spread of TODOs in the work at the moment now that I look at it again.

Next I'm going to try and address some more of the existing PR comments (above).

geoffw0 avatar Feb 20 '24 14:02 geoffw0

Just pushed an implementation of SummaryOutNode, and I'm finally happy with the state of the FlowSummaryNode class / test overall.

The major feature that's left is indirection support, and that will fix a lot (not all) of the missing results in tests. I guess I'll make a couple of proposals for the syntax for that and get a discussion going.

I think all of the test cases involving taint flow to the qualifier ("self") fail as well, let me know if you have any idea what might be missing / broken there. Actually it might just be that I've forgotten to finish one of the things on the TODO list you sent me a few weeks ago ... I'll check through that tomorrow as well.

geoffw0 avatar Mar 05 '24 17:03 geoffw0

I've just pushed two commits that add some more test cases, then fix the issue where argument -1 was the qualifier address, rather than the qualifier object, by adding the indirection value. However this has uncovered another issue - summaries have stopped working in some cases where the qualifier is a temporary object. I believe this may be an inconsistency in that we get the qualifier object in these cases rather than the qualifier address (so specifying 1 level of indirection breaks flow instead of fixing it). I'm not sure where that's coming from, my guess is that one of our inputs to the MAD library is slightly off.

geoffw0 avatar Mar 12 '24 18:03 geoffw0

I'm not sure where that's coming from, my guess is that one of our inputs to the MAD library is slightly off

It may also be possible that this is a problem in our dataflow node <> expression mapping (and thus unrelated to MaD). I can investigate that tomorrow.

MathiasVP avatar Mar 12 '24 18:03 MathiasVP

I'm not sure where that's coming from, my guess is that one of our inputs to the MAD library is slightly off

It may also be possible that this is a problem in our dataflow node <> expression mapping (and thus unrelated to MaD). I can investigate that tomorrow.

It was indeed such a problem. I've opened https://github.com/github/codeql/pull/15918 with a fix. Once we merge that PR into main and you merge main into this PR those results will reappear 🎉

MathiasVP avatar Mar 14 '24 09:03 MathiasVP

I just pushed another batch of changes:

  • indirection support (using the Argument[*0] style syntax in the end).
  • more test cases (and fixed a couple of mistakes in existing test cases).

I think we're at the point where the features people will commonly use are all working. We're missing:

  • indirection for sources + sinks.
  • parameters as sources.
  • qualifier fields as sources and sinks.
  • flow models for global variables.
  • function pointers.

I don't think we necessarily need to fix all of those right away, but I want to look into a couple of them a bit more and I'd appreciate any intuitions about what might be wrong. There are also still some unresolved discussions above, I intend to have another look at these this afternoon.

geoffw0 avatar Mar 25 '24 11:03 geoffw0

I just an early DCA run on this branch to see how things are looking (another will surely be wanted before we merge):

  • new alert for cpp/uncontrolled-allocation-size on vim__vim; the source is a call to getc, one of the new models-as-data sources, and it appears to be a correct result. :+1:
  • three new alerts or wobble for cpp/unbounded-write on nelson; again the sources are calls to getc. :+1:
  • "String cache size, per source" increases; I'm unsure if this is wobble or a real problem introduced by some part of the PR. I plan to run DCA again on the final version, if this is wobble it shouldn't show up again. ❓
  • no significant performance change was observed.

geoffw0 avatar Apr 03 '24 13:04 geoffw0

  • "String cache size, per source" increases; I'm unsure if this is wobble or a real problem introduced by some part of the PR. I plan to run DCA again on the final version, if this is wobble it shouldn't show up again. ❓

I don't think this is wobble. I think using MaD simply means a lot more string operations (because of all the string splitting done internally by the library). It may also be related to https://github.com/github/codeql/pull/15371/commits/76780d74d96e754a18cdfd98b03962c622f95a99, but I'm not too concerned about this

MathiasVP avatar Apr 03 '24 14:04 MathiasVP

As far as I can tell, the still failing tests divide up as follows:

  • five source/sink tests involving indirection.
  • one sink test involving a parameter (rather than an argument).
  • four summary field-to-field cases.
  • two source/sink tests involving qualifier fields.
  • five summary tests involving function pointers.

I'm going to hypothesize that none of the remaining source/sink cases can be fixed due to limitations in the shared library, namely that as I understand it, only very simple access paths are supported for sources / sinks.

As for the function pointers, I'm not sure. The summary nodes and summary calls all seem to be there. I don't know if you have any ideas, but I need someone to discuss this with.

geoffw0 avatar Apr 08 '24 17:04 geoffw0

There's been a lot of discussion and iteration on this PR which I think is now approaching completion. For the sake of everyone reading this I want to draw a line here:


I believe everything discussed above this line has been satisfactorily addressed. Please comment if you disagree or think I've missed anything. There are also seven potential follow-up issues (linked from the internal issue for this PR) mostly tracking the remaining MISSING results in the main test.

I'm aware that there is a merge conflict, and a possible future merge conflict with https://github.com/github/codeql/pull/15501 . Change notes need to be written, most likely some autoformatting needs to happen to make CI happy, and I should do a repeat DCA run.

geoffw0 avatar Apr 10 '24 13:04 geoffw0

I've just added the change notes and fixed the other minor issues. This PR is now ready for review. I will shortly begin a second DCA run.

geoffw0 avatar Apr 10 '24 14:04 geoffw0

The second DCA run:

  • was technically a failure because two projects failed to build; caused by problems outside of this PR.
  • overall analysis time is 2.8% slower, a little more than 0.8% on the first run; probably a mix of wobble and genuinely doing more work.
  • string cache size increase of 37.4% (same as the first run). We agreed on the previous run that it probably comes from the shared library and was not too concerning.
  • 4 new query results (same as the first run, all good results).

So there's probably a small performance loss, not unacceptable I'd say, but if there are any leads for possible improvements I would like to follow them. We may well be better off doing general performance work in future rather than pursuing tiny gains here.


I also ran a Swift DCA experiment, which was uneventful, unsurprisingly as the Swift changes in this PR are very minor (they just mirror CPP changes in a few places, mostly comments).

geoffw0 avatar Apr 11 '24 16:04 geoffw0

I've fixed the CI issues (hopefully) and the merge, though alert provenance is not properly computed (I've created another follow-up issue to cover that).

geoffw0 avatar Apr 12 '24 18:04 geoffw0

Ready for final approval + merge.

geoffw0 avatar Apr 15 '24 08:04 geoffw0