codeql
codeql copied to clipboard
How to get each Node in the dataflow paths with PathGraph
Description of the issue
Use Case
We are trying to use the dataflow analysis to get the explicitly accessed fields in a struct in Golang. To do that, we use the TaintTracking module, with source being the root struct, and sink being any(). The background is that we are trying to do some static analysis for Kubernetes controllers, which define complex nested struct types (e.g. RabbitMQ operator).
The taint analysis works like a charm. However, we are struggling to compute the path for the sink nodes. To get the struct field path, we need to get each DataFlow Node on the dataflow path from source to the sink, for all possible paths.
Example
Given the following Golang program, with a nested definition of MyType and MySubType:
type MyType struct {
subtype MySubType
}
type MySubType struct {
name string
id int
}
func main() {
foo := MyType{subtype: MySubType{name: "myName", id: 1}}
bar := foo.subtype.id
_ = bar
}
our objective is given taint source foo, we want to reach to sink bar (this works well),
and get the struct field path subtype.id.
Is there a clean way to get all field selection nodes on the path given the source and the sink?
Thanks a lot!
Does using SsaWithFields help?
from SsaWithFields ssaf
select ssaf, ssaf.getBaseVariable(), ssaf.getAUse()
This for example tells us that the dataflow node foo.subtype.id is a read of .subtype.id.
The tricky thing about walking flow paths and trying to accumulate the field-accesses seen along them is twofold: first there can be a large number of possible paths for a simple-seeming program, and second that there can be field-accesses that don't match how struct types are laid out. For example, if the program went
foo := MyType{subtype: MySubType{name: "myName", id: 1}}
bar := OtherType(otherField: foo.name.id}
baz := bar.otherField
Then the sequence of field accesses along the path from foo to baz is name, id, otherField, which of course doesn't match how the static struct types are laid out.
Could you perhaps give some other examples of what you'd want your query to return in this sort of case?
Thanks for the prompt reply! @smowton
Then the sequence of field accesses along the path from foo to baz is name, id, otherField, which of course doesn't match how the static struct types are laid out.
Yeah, we would expect .subtype.id in the case you mentioned. Perhaps we can get away by setting field write as the barrier. I think in our use cases such scenarios rarely happen.
Thanks for pointing us the direction!
from SsaWithFields ssaf
select ssaf, ssaf.getBaseVariable(), ssaf.getAUse()
Though one struggle we have is to separate different paths.
For example, in the following query, we want two separate paths: .subtype.id and .subtype.name
foo := MyType{subtype: MySubType{name: "myName", id: 1}}
bar := foo.subtype.id
baz := foo.subtype.name
The path-problem seems to be able to automatically group the nodes on the same path, but we did not find a good way to do that in the query itself.
Any insight would be greatly appreciated!
How about doing something like:
import go
Expr getAStructLiteralValue(StructLit e) {
exists(Expr element | element = e.getAnElement() |
if element instanceof KeyValueExpr
then result = element.(KeyValueExpr).getValue()
else result = element
)
}
string describeStructLiteralValue(Expr e) {
exists(Write w, Field f, DataFlow::Node base, DataFlow::Node value |
w.writesField(base, f, value) and value.asExpr() = e |
result = describeStructLiteralValue(base.asExpr()) + "." + f.getName()
or
result = f.getName()
)
}
string longestStuctLiteralValueDescription(Expr e) {
result = max(string s | s = describeStructLiteralValue(e) | s order by s.length())
}
module Blah implements DataFlow::ConfigSig {
predicate isSource(DataFlow::Node n) {
n.asExpr() = getAStructLiteralValue(_) and not n.asExpr() instanceof StructLit
}
predicate isSink(DataFlow::Node n) { any() }
}
module Flow = DataFlow::Global<Blah>;
from Flow::PathNode source, Flow::PathNode sink
where Flow::flowPath(source, sink)
and source != sink
select source, sink, longestStuctLiteralValueDescription(source.getNode().asExpr())
Note that by making the sources struct-literal members that are not themselves struct literals (i.e., "myName" and 1), and by using DataFlow instead of TaintTracking, we require that to find a result, there must have been a corresponding set of field accesses to extract the value. That is, you would not get a result for baz2 := foo.subtype, because we haven't fully unpacked the source value "myName" or 1 yet.
That means there's no need to look at the flow path explicitly -- we know if dataflow has found a result, the source has been fully extracted from its surrounding struct, so the field accesses required are exactly the fields it is nested within when the struct literal is created.
On the other hand if you want to highlight the places where the fields are accessed, then we need to do something different -- let me know if so.
@smowton Thanks again for the detailed proposed solution. Unfortunately the nested structs we are dealing with are deeply nested, and they could be reused in different structs, e.g.,
livenessProbe *v1.Probe
readinessProbe *v1.Probe
And the structs may be defined by external libraries.
We recently explored a bit more on the edges table produced by the path-problem query, and we believe we may pull this off by reconstructing the graph from all edges from the table.
We are currently figuring out how to get all the edges from query. The default query predicate from the PathGraph module seems to skip many implicit edges, such as implicit dereferences.
And the structs may be defined by external libraries.
Do you mean the type is defined externally, or that the instance of the struct is created externally? In other words, do you always have a struct literal that is the root of the path you want to explore, or can it be an external call to somepkg.GetAStruct() or the like?
We are currently figuring out how to get all the edges from query.
You should be aware here that the edges table, which is derived from PathNode.getASuccessor which in turn is driven by localFlowBigStep, as the name suggests deals in big steps, not the small steps seen by DataFlow::Node.getASuccessor. A big step is a sequence of small steps that ends at (a) a jump-step (one that discards call context, like a global variable write), (b) flow into or out of a call, (c) a store into or read from a struct, array or collection, or (d) a type-cast. The set of big-step terminators can also be augmented by the predicate neverSkip(DataFlow::Node) predicate of a dataflow configuration, though there may be performance consequences for introducing a lot of extra steps. Search for localFlowExit to see the details of how this is done.
Regarding the proper solution here, I have to say I still don't really understand your end goal. It might help to show a slightly more realistic example, and what problem or characteristic of the program you want CodeQL to highlight?
Hi @smowton , thanks for all the help. We have solved our problem by post-processing the SARIF file returned by the codeql, utilizing the path-problem.
To get the full path of the properties, we configured the never_skip predicate in the TaintFlow config to any(), so that we never skip any steps.
We defined the source of the taint analysis to be the root of the struct in our interest, and the sink to be any().
After running the query, we get the SARIF file and parse the paths from the results. By only looking at the selection nodes in the paths, we are able to get the full paths of the sinks.
In case CodeQL is doing field-sensitive taint analysis (where the taint flows into a subfield of another struct, and gets selected later), we handled it by maintaining a stack.
If anyone else encounter a use case like we do, we are happy to share our solution in greater detail.
Closing this as it is resolved now. Thanks again @smowton !!