codeql
codeql copied to clipboard
C++: Add support for `Element` content
This PR adds support for Element content for C++ like we have for other CodeQL supported languages. Given code such as:
std::vector<int> v;
int x = source();
v.push_back(x);
// ...
int y = v.at(i);
sink(y);
we now track the flow into v by a store step that pushes Element onto the access path and transfers flow to v. Then, once we reach v.at(i) we pop Element off the access path and transfer flow to the return value. This means we now track container flow as data flow instead of taint flow.
This is all very standard stuff for other languages by now (Tom did this for C# almost four years ago), but there are (of course 😂) a few challenges for C++. Some of these I've resolved in this PR, and I'll create follow-up issues for the rest.
Commit-by-commit review recommended.
Signatures for templates
How do we distinguish between these two constructors in a MaD summary (from here):
explicit vector(size_type count);explicit vector(size_type count, const T& value = T(), const Allocator& alloc);
the obvious answer in MaD is to distinguish them by providing a signature. However, the signature for the latter constructor mentions the parameter T whose parameter name is given by a template (i.e., the enclosing class' template list). And we can't be sure that this template is named T across all implementations of the STL.
Likewise, consider this constructor:
template< class InputIt > vector( InputIt first, InputIt last, const Allocator& alloc);
whose parameters are given by the template list of the function itself.
To solve the above problem, I've introduced MaD syntax for introducing template parameters when specifying the class name and the function name. Staying true to the spirit of C++, I've chosen this syntax:
["std", "vector<T,Allocator>", _, "vector<InputIterator>", "(InputIterator,InputIterator,const Allocator &)", "", _, _, _, _]
As you can see, the signature mentions both Allocator (from the class template), and InputIterator (from the function template).
This ensures that this MaD row will match the right constructor no matter what the implementation picked as template parameter names.
Internally, the name (InputIterator,InputIterator,const Allocator &) is normalized to a name that doesn't mention the template parameters before being compared with a normalized version of the parameter list in the database.
Handling indirections
The obvious summary for a function like push_back is: Argument[*0] -> Argument[-1].Element[*].
That is, it reads the value pointed to by the 0'th argument (because push_back takes a parameter by reference), and stores it as an Element content on the qualifier.
However, consider this example:
std::vector<int*> v;
int x = source();
int* p = &x;
v.push_back(p);
...
int* q = v.at(i);
sink(*q);
In this case, we track *p (i.e., the indirection of p) when we perform push_back, which means that the value that actually flows into push_back is the address of *p. That is, the node reresenting **p is actually what ends up flowing to the 0'th argument of push_back.
To solve this, we could introduce another summary for push_back: Argument[**0] -> Argument[-1].Element[**].
However, that introduces quite a burden on the developer when adding MaD rows. For example, if we need to add an additional indirection to every model we need to introduce another MaD row to every function.
To solve this problem I introduced a kind of "template" (no pun intended) for specifying an arbitrary number of indirections. So the summary for push_back actually looks like: Argument[*@0] -> Argument[-1].Element[*@] where @ means (any number of indirections).
When consuming MaD rows we then expand the summaries to:
Argument[*0] -> Argument[-1].Element[*]Argument[**0] -> Argument[-1].Element[**]Argument[***0] -> Argument[-1].Element[***]Argument[****0] -> Argument[-1].Element[****]Argument[*****0] -> Argument[-1].Element[*****]
And I've capped the number to 5. Dataflow performance wise there should be no concerns regarding pushing this further up, but it does mean we produce a lot more strings. We could also consider capping this to
"the maximum number of indirections in the database". I'll let this be future work, though :joy:
I'm not super happy with the syntax I've chosen here, so I'm happy to bikeshed on this once we're happy with everything else in this PR.
Supporting associative containers
Initially, I wanted to also replace all of our taint models for std::map with MaD rows. However, I ran into a performance problem. To see what happens, consider the hypothetical MaD summary for std::map::insert:
std::pair<iterator, bool> insert( const value_type& value );
(ignoring @s for this example since it's not relevant for this discussion): Argument[*0] -> ReturnValue.Field[first].Element[*].
That is, the input is the pointer to the first argument (since insert takes a const reference), and returns a std::pair whose first element is an iterator to the inserted element.
Now: Which exact field does Field[first] refer to? Since std::pair<T, U> is a template there will be many many std::pair<T, U> instantiations in the database, and the above MaD row will create a store step that writes any of them to the access path. On a DB I was testing with there were ~1500 instantiations of std::pair<T, U> (i.e., ~1500 combinations of T and U) which made the flow summary library really unhappy.
To solve this problem we need to extend the MaD language so that we can annotate which first field we're talking about (i.e., it's the first field of the std::pair class whose template parameters are T = the template argument from the enclosing std::map and U = bool). I haven't done this yet, though.
To "solve" this performance problem I've simply chosen not to model any associative containers for now. However, anyone who ventures into doing this will hit the same performance problem. So we need to resolve this at some point (but not necessarily in this PR since this is large enough as-is).
DCA shows nothing spectacular:
- Performance is fine (this was my main worry 🎉)
- One lost result which has to do with iterators still not being fully converted to useFixed by 40fb59dc0bb7ed1d3d07e4047f2c854a1ae37b33Elementcontent. I expect this result to appear once we model iterators fully withElementcontent. - Two new FPs on
cpp/double-free. These appear becauseElementcontent allows us to have dataflow through containers now (instead of simply having it be taintflow), so we need to restrict container flow incpp/double-freesimilarly to how we restrict flow through ArrayExpr in that query.
I don't really understand the motivation for @. If we're not propagating indirect data properly through models, isn't that an issue with data-flow itself (or the indirections feature) - not an issue with the MAD representation?
I don't really understand the motivation for
@. If we're not propagating indirect data properly through models, isn't that an issue with data-flow itself (or the indirections feature) - not an issue with the MAD representation?
In general we don't infer a *n1 -> *n2 step when a model specifies a n1 -> n2 step. We actually used to do this (see this PR), but pulled it out pretty quickly because of your excellent comment on the PR 😂
The TLDR of your comment is that we cannot assume in general that *n1 -> *n2 holds just because a model gives us n1 -> n2. Your char *stringToUpper(char *source) example is one where we cannot assume this.
So the @ notation is a way to do what Jonas suggested here (as a response to your observation).
Thanks, that makes sense. I'll find time to review the code changes at a later point, please ping me if anything is waiting on this.
Sounds good. Thanks! I'm not blocked on getting this merged so feel fee to take all the time you need 😄
I don't agree with the
no-change-note-requiredtag. I think we need to explain the changes to the MAD syntax to users.
That's fair. I wanted to delay the change note until we were sure this was the format we wanted. Note that the syntax is backwards compatible with everything we had before. Once we're happy with this syntax I'll amend our existing documentation and add a proper change note (which may happen in a subsequent PR if that's okay with you).
Initial review. There's a lot going on here, so lots of comments, sorry.
No need to apologise for having many comments 😄
I wanted to delay the change note until we were sure this was the format we wanted.
That sounds sensible. There's also the information at the top of ExternalFlow.qll that needs to be updated with the new syntax (Element and @).
That sounds sensible. There's also the information at the top of
ExternalFlow.qllthat needs to be updated with the new syntax (Elementand@).
I've done (a version of) this in 0b4459db7587c20acf3219dc50f43f4a8968f42a:
- I've added QLDoc to mention how to introduce template names from class names and function names,
- I've added QLDoc to mention
@
I have not added QLDoc for Element content since we don't have QLDoc for Field content, nor do any other language have this. It should probably be done eventually, but I'd prefer to delay this.
Thanks for the updates, including the help and qldoc. I think we still need:
- [x] an answer to: "Shouldn't that be just Argument[-1].Element[@] then (i.e. 0 or more indirections - not 1 or more indirections)?".
- [x] after that I need to review the new models.
- [x] change note for the newly supported MAD syntax.
Then I think we can merge.
Yep, (1) and (3) are coming later today 👍 Thanks for summarizing it!
Yep, (1) and (3) are coming later today 👍 Thanks for summarizing it!
These are now completed in https://github.com/github/codeql/pull/16791/commits/5be948533c22525716758721635fe69095c11492 and https://github.com/github/codeql/pull/16791/commits/d7eac4d56772fad63e698a8cdbe97b9f2ab3764a
I think I've addressed all your 🦅 eyed comments on the models now 🤞