joern
joern copied to clipboard
reachableByFlows inacurate reports for format string C functions
Description
I've been playing with Joern recently to identify format string vulnerabilities in C code using data flow tracking with reachableByFlows
:
val sink = cpg.call.name("fprintf").argument.order(2)
val src = cpg.call.name("fgets").argument.order(1)
sink.reachableByFlows(src).p
By doing so, I identified that Joern will report the second argument of fprintf
as being tainted/user controlled even when it's not. This only happens when there is at least one level of indirection (the sink
function call is made within another function).
Version Information
- OS: Debian 12 (bookworm)
- Joern version:
1.1.1495
and2.0.128
Expected behavior
With the C example below, Joern will not report the sink
as being reachable by src
. This is the expected behavior.
#include <stdarg.h>
#include <stdlib.h>
#include <stdio.h>
int main() {
char *user_input;
char line[156];
user_input = fgets(line, 156, stdin);
fprintf(stderr,"literal format %s", line);
return 0;
}
Observed behavior
However, with the C example below that introduce a call to call
before calling fprintf
, Joern will report that sink
is reachable by src
.
#include <stdarg.h>
#include <stdlib.h>
#include <stdio.h>
void call(char *format, char* arg)
{
fprintf(stderr, format, arg);
}
int main() {
char *user_input;
char line[156];
user_input = fgets(line, 156, stdin);
call("literal format %s", line);
return 0;
}
Both of these code snippets do exactly the same thing and are reduced test cases.
Observed Paths (reachableByFlows)
If we apply taint tracking manually, this is what we get from Joern:
joern> val src = cpg.method.name("fgets", "__isoc99_fgets").callIn.argument(1)
src: Traversal[Expression] = Traversal
joern> val sink = cpg.method.name("fprintf").callIn.argument(2)
sink: Traversal[Expression] = Traversal
joern> sink.reachableByFlows(src).l
res2: List[Path] = List(
Path(
elements = List(
Identifier(
id = 24L,
argumentIndex = 1,
argumentName = None,
code = "line",
columnNumber = Some(value = 24),
dynamicTypeHintFullName = ArraySeq(),
lineNumber = Some(value = 14),
name = "line",
order = 1,
typeFullName = "char[156]"
),
Identifier(
id = 29L,
argumentIndex = 2,
argumentName = None,
code = "line",
columnNumber = Some(value = 31),
dynamicTypeHintFullName = ArraySeq(),
lineNumber = Some(value = 15),
name = "line",
order = 2,
typeFullName = "char[156]"
),
MethodParameterIn(
id = 9L,
code = "char* arg",
columnNumber = Some(value = 25),
dynamicTypeHintFullName = ArraySeq(),
evaluationStrategy = "BY_VALUE",
index = 2,
isVariadic = false,
lineNumber = Some(value = 5),
name = "arg",
order = 2,
typeFullName = "char"
),
Identifier(
id = 14L,
argumentIndex = 3,
argumentName = None,
code = "arg",
columnNumber = Some(value = 27),
dynamicTypeHintFullName = ArraySeq(),
lineNumber = Some(value = 7),
name = "arg",
order = 3,
typeFullName = "char"
),
Identifier(
id = 13L,
argumentIndex = 2,
argumentName = None,
code = "format",
columnNumber = Some(value = 19),
dynamicTypeHintFullName = ArraySeq(),
lineNumber = Some(value = 7),
name = "format",
order = 2,
typeFullName = "char"
)
)
)
)
Everything goes well, the line then arg values are tracked and suddenly format appears for no apparent reason.
Observed Paths (reachableBy)
Also tried using different approaches (reachableBy
instead of reachableByFlow
):
joern> val sink = cpg.call.name("fprintf").argument.order(2)
sink: Traversal[Expression] = Traversal
joern> val src = cpg.call.name("fgets").argument.order(1)
src: Traversal[Expression] = Traversal
joern> sink.reachableByFlows(src).p
res12: List[String] = List(
"""________________________________________________________________________
| nodeType | tracked | lineNumber| method| file |
|=======================================================================|
| Identifier | fgets(line, 156, ... | 14 | main | main.c |
| Identifier | call("literal for... | 15 | main | main.c |
| MethodParameterIn | call(char *format... | 5 | call | main.c |
| Identifier | fprintf(stderr, f... | 7 | call | main.c |
| Identifier | fprintf(stderr, f... | 7 | call | main.c |
"""
)
joern> val sink = cpg.call.name("fprintf").argument.order(2)
sink: Traversal[Expression] = Traversal
joern> val src = cpg.call.name("fgets").argument.order(1)
src: Traversal[Expression] = Traversal
joern> sink.reachableBy(src).p
res15: List[String] = List(
"(IDENTIFIER,24): ARGUMENT_INDEX: 1, CODE: line, COLUMN_NUMBER: 24, LINE_NUMBER: 14, NAME: line, ORDER: 1, TYPE_FULL_NAME: char[156]"
)
joern> val sink = cpg.call.name("fprintf").argument.order(2)
sink: Traversal[Expression] = Traversal
joern> val src = cpg.call.name("fgets").argument.order(1)
src: Traversal[Expression] = Traversal
joern> sink.reachableBy(src).l
res18: List[Expression] = List(
Identifier(
id = 24L,
argumentIndex = 1,
argumentName = None,
code = "line",
columnNumber = Some(value = 24),
dynamicTypeHintFullName = ArraySeq(),
lineNumber = Some(value = 14),
name = "line",
order = 1,
typeFullName = "char[156]"
)
)
Off the top of my head, it may be because of Joern over-approximating unseen code. You may need to define the semantics of fprintf
or fgets
e.g., https://github.com/joernio/joern/blob/e602413c7cbb143bf60b48390e6fbf7a80a30d38/dataflowengineoss/src/main/scala/io/joern/dataflowengineoss/DefaultSemantics.scala#L88
Still having the same results with a patch like this one:
diff --git a/dataflowengineoss/src/main/scala/io/joern/dataflowengineoss/DefaultSemantics.scala b/dataflowengineoss/src/main/scala/io/joern/dataflowengineoss/DefaultSemantics.scala
index fd8a1df0b..94d3becb9 100644
--- a/dataflowengineoss/src/main/scala/io/joern/dataflowengineoss/DefaultSemantics.scala
+++ b/dataflowengineoss/src/main/scala/io/joern/dataflowengineoss/DefaultSemantics.scala
@@ -112,6 +112,8 @@ object DefaultSemantics {
F("ferror", List((1, 1), (1, -1))),
F("fflush", List((1, 1), (1, -1))),
F("fgetc", List((1, 1), (1, -1))),
+ F("fgets", List((1, 1), (1, -1), (2, -1), (3, -1))),
+ F("fprintf", List((1, 1), (1, -1), (2, -1)),
F("fwrite", List((1, 1), (1, -1), (2, -1), (3, -1), (4, -1))),
F("free", List((1, 1))),
F("getc", List((1, 1))),
Are my definitions wrong ?
So yes, they are a bit wrong, but I think it's time I add a more comprehensive entry for this in the Joern documentation.
Semantics
-1
is the return value, and 0
is the receiver/base of the call (more of an OOP thing). So everything > 0
is the call arguments. The definitions define where flow is propagated (whitelisting) so anything not defined is assumed killed/sanitized. If a definition is not present, then it's over-approximated. This means you also have to be explicit when you define arguments that aren't sanitized.
The two functions don't really return anything, so I wouldn't add flows for -1
. However, we need to define the "identity" flows and also that some arguments do affect others, e.g. gets
reads data from one flow into another.
I was able to reproduce your error, and the outcome is pretty strange. The good news is that I managed to define flows that helped:
F("fgets", List((1, 1), (2, 2), (3, 3), (2, 1), (3, 1))),
F("fprintf", List((1, 1), (2, 2), (3, 3), (2, 1), (3, 1))),
The first three entries indicate that those arguments are not sanitized. The last two show that those arguments taint the first argument. fprintf
is special in that it has varargs and we don't have a symbol to handle that case. Earlier this year I included the PassThroughMapping
which creates a varargs-like mapping of x -> x
and x -> -1
for all x >= 0
. This can be added with PTF
instead of F
:
PTF("fgets", List((2, 1), (3, 1))),
PTF("fprintf", List((2, 1), (3, 1))),
Order versus ArgumentIndex
Another thing that is worth mentioning is that argumentIndex
!= order
. order
is based on the position of the node in relation to its AST siblings. argumentIndex
is similar but refers to the order with respect to AST siblings joined to the AST parent call via an additional ARGUMENT
edge. This is usually > 0 and the receiver (in OOP) has a RECEIVER
edge in addition to the AST edge.
This mostly makes a difference in OOP languages, but for safety, I would change your query to:
val sink = cpg.call.nameExact("fprintf").argument(2)
val src = cpg.call.nameExact("fgets").argument(1)
Another "optimisation" is using nameExact
over name
since the former uses indexing and the latter assumes regex.