1. Untrackable flows in JavaScript; 2. JsSrc2Cpg vs Js2Cpg
I've been exploring jssrc2cpg and wanted to share some observations and queries that I have regarding the package (These issues also pertain to the CLI interactive shell provided by joern)
1. Taint Tracking / Data Flow Issues
a. Nested Member Access I have noticed that taint tracking works from an object to its direct property (if a is tainted, a.b is treated as tainted), but this doesn't reliably extend to further nested member access (eg. a.b.c)
This seems unintuitive, since the taint from a is likely to propagate to any of a's method invocations or member accesses
b. Control Structures within Lambdas (Anonymous Functions) If a control structure (eg, if, while) is inside a lambda (anonymous) function, and it references an identifier from an outer scope which is not passed as a parameter, the flow is interrupted
// consider this example code snippet
const child_process = require('child_process');
function handle(req, res) {
const tmpUrl = new URL(req.url, 'http://localhost')
const params = tmpUrl.searchParams
const cmd = params.get('cmd')
setTimeout(() => {
if(cmd) {
const hi = cmd
shellExec(cmd, () => {})
}
}, 10)
}
function shellExec(myCmd) {
child_process.exec(myCmd)
}
expectation: since req is a taint source, and cmd (derived from it) is eventually passed to the sink child_process.exec(), the flow should be tracked from source to the sink
// scala queries to define source and sink
val source = cpg.method("handle").parameter.index(1).l
val sink = cpg.call("exec").arugment(1).l
However, the flow tracking breaks!
If I remove the lambda or the if, the flow is correctly recognized. The flow is recognized even when the if's predicate does not contain cmd
Are 1.a and 1.b known limitations, and are there recommended workarounds to handle this type of code?
2. jssrc2cpg vs js2cpg
I noticed the following statement in the jssrc2cpg README:
"Warning: This is work in progress. Use https://github.com/ShiftLeftSecurity/js2cpg as a mature alternative."
I've experimented with both jssrc2cpg and js2cpg, and I am aware that they use different parsers internally
However, I'm not clear on any major differences between the two, or the advantages of using js2cpg over jssrc2cpg
Could you please clarify:
What are the main reasons to prefer js2cpg? Are there feature/support/accuracy gaps between the tools? Is jssrc2cpg still being actively developed, or should new projects switch to js2cpg? Thanks for taking the time to read and address these questions.
I am more than happy to elaborate on my testing or provide more code samples if needed. Any guidance, or even pointers to relevant issues/docs, would be much appreciated!
Regards, Arjun
System Information: OS: Red Hat Linux 8.10 (Linux 4.xx) Joern version: 4.0.335
I will comment on jssrc2cpg vs js2cpg first: The latter uses babel to transpile any JS or TS code to plain ES6 JS code and parses the result with the GraalJS parser. That leads so some problems, e.g., longer runtime (transpiling needs to download and install dependencies first), multi module projects may make trouble, faulty source map handling may result in line / column info hiccups etc. But it supports everything that is in ES6. jssrc2cpg on the other hand parses all code with the babel parser directly without any transpiling and any dependency download / install. But some more modern constructs may not be handled yet or be handled incompletely.
I would say use jssrc2cpg unless you really rely on something in your code it can't handle.
For the dataflow issues: Afair, these two things are exactly points where the open source dataflow engine in Joern is somewhat limited. But I am not the expert there. Maybe @johannescoetzee can say something.
I haven't worked with the open source dataflow engine in quite a while, so can't say anything definitive on this. @DavidBakerEffendi or @fabsx00 would likely have a better idea about what's going on
a. Yes this is strange, the over approximation may not be recursive as I think it should be. The engine treats such accesses as calls and does not model the heap at all. b. Lambdas support in the data-flow engine is not fully supported, there is some workaround currently based on selecting sources, but this is known. This is not on any roadmap currently.
Thank you for the insights about jssrc2cpg vs js2cpg!
For the untrackable flows, What could be the possible workarounds for (a) and (b)? Can this be achieved by maybe invoking the reachableByFlows() method in separate parts?
I think for both cases, you can try to select more nodes around your source/sink queries. E.g.,
(a) def sink = cpg.call("foo").argument.ast: This will include all AST children of call arguments for example
(b) You may need to make quite a weird/ugly query to consider the parent module and tag all first instances of identifiers that are shared. I believe the engine tries to do this, but this only works in the backwards direction out of the lambda, not forwards.
Thank you for the help!
these two things are exactly points where the open source dataflow engine in Joern is somewhat limited
Does this mean that there is a closed source dataflow engine that doesn't have these limitations? If so, are the differences documented somewhere? (Sorry if this is a naive question, but the existence of a different version of the engine is not apparent from the readme, or the docs on joern.io)