codeql icon indicating copy to clipboard operation
codeql copied to clipboard

Customizing string representation of data flow nodes in SARIF or CSV results for Taint Tracking

Open saikatG opened this issue 1 year ago • 10 comments

Hi,

Currently, when running a cwe query such as TaintedPath (cwe 22) on a java project, I retrieve the CodeFlow for each result in the SARIF files as shown below. Is there an easy way to customize the string representation of nodes in the output, such as, in the message["text"] part for the node? For instance, for a method call I would like the string to have the format "package:class:methodname" instead of "methodname(...)" -- which is the default. Would I need to override the data flow node for this?

Thanks in advance!

...
 "codeFlows" : [ {
        "threadFlows" : [ {
          "locations" : [ {
            "location" : {
              "physicalLocation" : {
                "artifactLocation" : {
                  "uri" : "one-java-agent/src/main/java/com/alibaba/oneagent/AgentImpl.java",
                  "uriBaseId" : "%SRCROOT%",
                  "index" : 0
                },
                "region" : {
                  "startLine" : 81,
                  "startColumn" : 35,
                  "endColumn" : 69
                }
              },
              "message" : {
                "text" : "getFile(...) : String"
              }
            }
          },
....

saikatG avatar Apr 07 '24 20:04 saikatG

Hi @saikatG 👋

I am not sure there is a super easy way of doing that. Overriding the toString predicate may be possible here to make this work. However, since you have the location information and, presumably, also have the source code, you could retrieve the information you are after directly from the source code with the help of a Java parser. This is probably the easiest.

mbg avatar Apr 08 '24 11:04 mbg

Thanks @mbg. Overriding the toString of DataFlow Node works well!

saikatG avatar Apr 09 '24 14:04 saikatG

Hi CodeQL devs, re-opening this issue because I ran into a problem.

I tried something like this:

bindingset[n]
string getFullName(DataFlow::Node n){
   
  result = (n.asExpr().(Call)).getCallee().getDeclaringType().getQualifiedName() + "." + (n.asExpr().(Call)).getCallee().getName()
  or 
  result = (n.asExpr().(Argument)).getCall().getCallee().getDeclaringType().getAnAncestor().getQualifiedName() + "." +  (n.asExpr().(Argument)).getCall().getCallee().getName() + "::arg::" + (n.asExpr().(Argument)).getPosition()
  or
  result = (n.(DataFlow::ParameterNode).asParameter().getCallable().getDeclaringType().getQualifiedName() + "." + (n.(DataFlow::ParameterNode).asParameter().getCallable().getDeclaringType().getName()) + "::par::" + (n.(DataFlow::ParameterNode).asParameter().getPosition())) 
  or
  result = n.toString()
}
 class MyNode extends DataFlow::Node {
   override string toString() {
     result = getFullName(this)
   }
 }

It works well, but its turning out to be too expensive to compute in some cases. Any suggestions on how to optimize this?

saikatG avatar Apr 15 '24 21:04 saikatG

its turning out to be too expensive to compute in some cases

Could you provide more information? How does this manifest for you (long analysis time / out of memory errors)? How big is the Java project you are running this against?

mbg avatar Apr 16 '24 14:04 mbg

So codeql just takes a very long time to generate the csv/sarif files after the "Interpreting results" stage. I waited for 10-15 mins and then killed it. Ofc, it finishes fast w/o the string conversion code above. The project has 97k lines of code. I am running the XSS query, but that shouldn't be an important factor i think?

Running queries.
Compiling query plan for .../XSS.ql.
[1/1 comp 33.6s] Compiled .../XSS.ql.
Starting evaluation of .../XSS.ql.
[1/1 eval 10s] Evaluation done; writing results to .../XSS.bqrs.
Shutting down query evaluator.
Interpreting results.

saikatG avatar Apr 16 '24 14:04 saikatG

I assume you're running the CodeQL CLI directly on the command-line. What is the command you used and which version of CodeQL are you using?

I tried running the XSS query with your additions on the database for the repository you linked to and this worked fine for me using the VSCode extension (although I note that there were no results).

mbg avatar Apr 16 '24 15:04 mbg

Sorry about the missing details.

Yes, I am using the codeql cli 2.15.3. I use database analyze --rerun [db_dir] --format=sarif-latest --output=results.sarif [query_path]

I am using a custom version of the XSS query that uses my own list of 100 sources and 38 sinks. I think this as well may be causing the difference in execution. So, I have a MySources.qll and MySinks.qll file that looks like this:

predicate isMySourceMethod(Method m)
{
(
    m.getName() = "get" and
    m.getDeclaringType().getSourceDeclaration().hasQualifiedName("java.util", "Map<String,String>")
)
or
(
    m.getName() = "put" and
    m.getDeclaringType().getSourceDeclaration().hasQualifiedName("java.util", "Map<String,String>")
)
or 
...

and I have a MyXSSQuery.qll file that uses it

module MyXssConfig implements DataFlow::ConfigSig {
  predicate isSource(DataFlow::Node source) { 
    isMySourceMethod(source.asExpr().(MethodCall).getMethod()) 
  }
...

(Same for sinks)

I can't share the full files right now. But I hope this gives you an idea. Do you think this may be causing the slowdown? Interestingly, this still works ok when I don't use "toString". Note that I put the "toString" code in this MyXSSQuery.qll file and not the XSS.ql file.

saikatG avatar Apr 16 '24 16:04 saikatG

I am using the codeql cli 2.15.3.

This is a fairly old version. It may be worth upgrading to the latest version to see if that resolves the issue for you. (We release new versions every two weeks.)

I am using a custom version of the XSS query that uses my own list of 100 sources and 38 sinks. I think this as well may be causing the difference in execution.

This is very much possible of course. You could try to run the XSS query without your custom sources and sinks, but still with the overriden toString predicate, and see if that works OK. If that does work, I'd introduce a source/sink that you know will lead to a result being found and try it with just those.

Let me know what the outcomes of those experiments are and we can go from there.

mbg avatar Apr 16 '24 16:04 mbg

Thanks a lot @mbg! I will try it out.

While we are on this, another question for you: Is there an easy way to specify summaries with a custom tag? For instance, i see yml files in ext folder that contain specs for many libraries, but they only have "taint" or "value" as tags. Is it possible to use my own custom tag and only consider them as summaries during execution? I see that it can be done for sources and sinks with isSinkNode(node, tag) and isSourceNode(node, tag) predicates. Anything similar for summaries?

saikatG avatar Apr 16 '24 17:04 saikatG

While we are on this, another question for you: Is there an easy way to specify summaries with a custom tag? For instance, i see yml files in ext folder that contain specs for many libraries, but they only have "taint" or "value" as tags. Is it possible to use my own custom tag and only consider them as summaries during execution? I see that it can be done for sources and sinks with isSinkNode(node, tag) and isSourceNode(node, tag) predicates. Anything similar for summaries?

There is some documentation in the code which comments on this:

  1. The kind column is a tag that can be referenced from QL to determine to which classes the interpreted elements should be added. For example, for sources "remote" indicates a default remote flow source, and for summaries "taint" indicates a default additional taint step and "value" indicates a globally applicable value-preserving step. For neutrals the kind can be summary, source or sink to indicate that the neutral is neutral with respect to flow (no summary), source (is not a source) or sink (is not a sink).

So in other words, taint and value have special meanings that affect how the models are used.

What would you hope to accomplish with a custom tag?

mbg avatar Apr 17 '24 08:04 mbg