codeql icon indicating copy to clipboard operation
codeql copied to clipboard

[Java] Dataflow through object

Open KylerKatz opened this issue 9 months ago • 4 comments
trafficstars

Hello,

I am getting some false positives with some of my queries, which are usually centered around a source node flowing into an object and then other data from that object flowing into a sink node.

Here is a simple example,

env.put(Context.SECURITY_CREDENTIALS, password);
LOG.error("connection error [{}], failover connection to [{}]", env.get(Context.PROVIDER_URL), this.ldapURI.toString());

In this case, password is a source variable. While LOG.error is a sink. However, by having it flow into env, it now marks any use of the object as a detection eventough this case has nothing to do with password. Is there any way to reduce cases like this?

Thank you

KylerKatz avatar Feb 05 '25 02:02 KylerKatz

Looks like a fairly simple case of storing a value in a map using one key and then getting false positive flow at the point where a value is taken from the map using a different key. This is a known problem and it's due to the level of precision at which we currently track container content for flow in Java. We have a solution lined up that we've been wanting to implement for some while now, but it's simply been a matter of priorities. Thank you for the report, though, signal like this helps us prioritise, and this is definitely on our radar. Unfortunately I cannot provide an estimate for when we'll be able to prioritise this at the moment, but I've linked this issue from our internal tracking issue on this, so I'll be able to give an update here once it's done.

aschackmull avatar Feb 05 '25 08:02 aschackmull

Hello @aschackmull,

Thank you for the information.

KylerKatz avatar Feb 05 '25 18:02 KylerKatz

Hello @aibaars and @hvitved,

I'm a researcher also using CodeQL to analyze Java programs, and I'm curious about the current precision of CodeQL's interprocedural data flow analysis for Java. After reviewing the documentation and the code related to the Dataflow classes in the open-source libraries, I found limited information on this topic. Specifically, I'd like to clarify a few points:

Virtual Method Resolution in Call Graph Construction: Does CodeQL's call graph construction support analyzing virtual method invocations (e.g., resolving the actual target methods at runtime based on object types)? If so, how precise is this analysis (e.g., does it use class hierarchy analysis, CHA, or more sophisticated techniques)?

Alias Analysis in Data Flow Tracking: Does CodeQL use alias analysis when constructing data flow graphs? If yes: Is the alias analysis context-sensitive, flow-sensitive, and/or field-sensitive? (From the example above, I feel that CodeQL's analysis should be field-insensitive?)

Handling of Dynamic Features: How does CodeQL handle Java features like reflection (e.g., Method.invoke()) or lambda expressions in data flow analysis? Are there known limitations in these scenarios?

Any insights would be greatly appreciated!

Wang-jiong-han avatar Jun 10 '25 10:06 Wang-jiong-han

Does CodeQL's call graph construction support analyzing virtual method invocations

For Java and C#, yes. The class hierarchies are used to determine overrides, and all overrides are tried when computing data flow.

Does CodeQL use alias analysis when constructing data flow graphs?

This is generally something that we do not handle, but some languages may have limited support.

How does CodeQL handle Java features like reflection (e.g., Method.invoke()) or lambda expressions in data flow analysis?

I don't know about Java, but C# does a bit of call resolution also for reflection calls. Lambda calls are handled in both C# and Java, but for C# (not sure about Java) lambdas are not tracked when they are put on the heap (e.g. assigned to fields).

hvitved avatar Jun 17 '25 08:06 hvitved