joern reachableByFlows inacurate reports for format string C functions

Description

I've been playing with Joern recently to identify format string vulnerabilities in C code using data flow tracking with reachableByFlows:

val sink = cpg.call.name("fprintf").argument.order(2) 
val src = cpg.call.name("fgets").argument.order(1)
sink.reachableByFlows(src).p

By doing so, I identified that Joern will report the second argument of fprintf as being tainted/user controlled even when it's not. This only happens when there is at least one level of indirection (the sink function call is made within another function).

Version Information

OS: Debian 12 (bookworm)
Joern version: 1.1.1495 and 2.0.128

Expected behavior

With the C example below, Joern will not report the sink as being reachable by src. This is the expected behavior.

#include <stdarg.h>
#include <stdlib.h>
#include <stdio.h>

int main() {
    char *user_input;
    char line[156];

    user_input = fgets(line, 156, stdin);
    fprintf(stderr,"literal format %s", line);
    return 0;
}

Observed behavior

However, with the C example below that introduce a call to call before calling fprintf, Joern will report that sink is reachable by src.

#include <stdarg.h>
#include <stdlib.h>
#include <stdio.h>

void call(char *format, char* arg)
{
  fprintf(stderr, format, arg);
}

int main() {
    char *user_input;
    char line[156];

    user_input = fgets(line, 156, stdin);
    call("literal format %s", line);
    return 0;
}

Both of these code snippets do exactly the same thing and are reduced test cases.

Observed Paths (reachableByFlows)

If we apply taint tracking manually, this is what we get from Joern:

joern> val src = cpg.method.name("fgets", "__isoc99_fgets").callIn.argument(1) 
src: Traversal[Expression] = Traversal

joern> val sink = cpg.method.name("fprintf").callIn.argument(2) 
sink: Traversal[Expression] = Traversal

joern> sink.reachableByFlows(src).l 
res2: List[Path] = List(
  Path(
    elements = List(
      Identifier(
        id = 24L,
        argumentIndex = 1,
        argumentName = None,
        code = "line",
        columnNumber = Some(value = 24),
        dynamicTypeHintFullName = ArraySeq(),
        lineNumber = Some(value = 14),
        name = "line",
        order = 1,
        typeFullName = "char[156]"
      ),
      Identifier(
        id = 29L,
        argumentIndex = 2,
        argumentName = None,
        code = "line",
        columnNumber = Some(value = 31),
        dynamicTypeHintFullName = ArraySeq(),
        lineNumber = Some(value = 15),
        name = "line",
        order = 2,
        typeFullName = "char[156]"
      ),
      MethodParameterIn(
        id = 9L,
        code = "char* arg",
        columnNumber = Some(value = 25),
        dynamicTypeHintFullName = ArraySeq(),
        evaluationStrategy = "BY_VALUE",
        index = 2,
        isVariadic = false,
        lineNumber = Some(value = 5),
        name = "arg",
        order = 2,
        typeFullName = "char"
      ),
      Identifier(
        id = 14L,
        argumentIndex = 3,
        argumentName = None,
        code = "arg",
        columnNumber = Some(value = 27),
        dynamicTypeHintFullName = ArraySeq(),
        lineNumber = Some(value = 7),
        name = "arg",
        order = 3,
        typeFullName = "char"
      ),
      Identifier(
        id = 13L,
        argumentIndex = 2,
        argumentName = None,
        code = "format",
        columnNumber = Some(value = 19),
        dynamicTypeHintFullName = ArraySeq(),
        lineNumber = Some(value = 7),
        name = "format",
        order = 2,
        typeFullName = "char"
      )
    )
  )
)

Everything goes well, the line then arg values are tracked and suddenly format appears for no apparent reason.

Observed Paths (reachableBy)

Also tried using different approaches (reachableBy instead of reachableByFlow):

joern> val sink = cpg.call.name("fprintf").argument.order(2) 
sink: Traversal[Expression] = Traversal

joern> val src = cpg.call.name("fgets").argument.order(1) 
src: Traversal[Expression] = Traversal

joern> sink.reachableByFlows(src).p 
res12: List[String] = List(
  """________________________________________________________________________
| nodeType          | tracked              | lineNumber| method| file   |
|=======================================================================|
| Identifier        | fgets(line, 156, ... | 14        | main  | main.c |
| Identifier        | call("literal for... | 15        | main  | main.c |
| MethodParameterIn | call(char *format... | 5         | call  | main.c |
| Identifier        | fprintf(stderr, f... | 7         | call  | main.c |
| Identifier        | fprintf(stderr, f... | 7         | call  | main.c |
"""
)

joern> val sink = cpg.call.name("fprintf").argument.order(2) 
sink: Traversal[Expression] = Traversal

joern> val src = cpg.call.name("fgets").argument.order(1) 
src: Traversal[Expression] = Traversal

joern> sink.reachableBy(src).p 
res15: List[String] = List(
  "(IDENTIFIER,24): ARGUMENT_INDEX: 1, CODE: line, COLUMN_NUMBER: 24, LINE_NUMBER: 14, NAME: line, ORDER: 1, TYPE_FULL_NAME: char[156]"
)

joern> val sink = cpg.call.name("fprintf").argument.order(2) 
sink: Traversal[Expression] = Traversal

joern> val src = cpg.call.name("fgets").argument.order(1) 
src: Traversal[Expression] = Traversal

joern> sink.reachableBy(src).l 
res18: List[Expression] = List(
  Identifier(
    id = 24L,
    argumentIndex = 1,
    argumentName = None,
    code = "line",
    columnNumber = Some(value = 24),
    dynamicTypeHintFullName = ArraySeq(),
    lineNumber = Some(value = 14),
    name = "line",
    order = 1,
    typeFullName = "char[156]"
  )
)

Oct 25 '23 08:10 qkaiser

Off the top of my head, it may be because of Joern over-approximating unseen code. You may need to define the semantics of fprintf or fgets e.g., https://github.com/joernio/joern/blob/e602413c7cbb143bf60b48390e6fbf7a80a30d38/dataflowengineoss/src/main/scala/io/joern/dataflowengineoss/DefaultSemantics.scala#L88

Dec 03 '23 13:12 DavidBakerEffendi

Still having the same results with a patch like this one:

diff --git a/dataflowengineoss/src/main/scala/io/joern/dataflowengineoss/DefaultSemantics.scala b/dataflowengineoss/src/main/scala/io/joern/dataflowengineoss/DefaultSemantics.scala
index fd8a1df0b..94d3becb9 100644
--- a/dataflowengineoss/src/main/scala/io/joern/dataflowengineoss/DefaultSemantics.scala
+++ b/dataflowengineoss/src/main/scala/io/joern/dataflowengineoss/DefaultSemantics.scala
@@ -112,6 +112,8 @@ object DefaultSemantics {
     F("ferror", List((1, 1), (1, -1))),
     F("fflush", List((1, 1), (1, -1))),
     F("fgetc", List((1, 1), (1, -1))),
+    F("fgets", List((1, 1), (1, -1), (2, -1), (3, -1))),
+    F("fprintf", List((1, 1), (1, -1), (2, -1)),
     F("fwrite", List((1, 1), (1, -1), (2, -1), (3, -1), (4, -1))),
     F("free", List((1, 1))),
     F("getc", List((1, 1))),

Are my definitions wrong ?

Dec 12 '23 15:12 qkaiser

So yes, they are a bit wrong, but I think it's time I add a more comprehensive entry for this in the Joern documentation.

Semantics

-1 is the return value, and 0 is the receiver/base of the call (more of an OOP thing). So everything > 0 is the call arguments. The definitions define where flow is propagated (whitelisting) so anything not defined is assumed killed/sanitized. If a definition is not present, then it's over-approximated. This means you also have to be explicit when you define arguments that aren't sanitized.

The two functions don't really return anything, so I wouldn't add flows for -1. However, we need to define the "identity" flows and also that some arguments do affect others, e.g. gets reads data from one flow into another.

I was able to reproduce your error, and the outcome is pretty strange. The good news is that I managed to define flows that helped:

    F("fgets", List((1, 1), (2, 2), (3, 3), (2, 1), (3, 1))),
    F("fprintf", List((1, 1), (2, 2), (3, 3), (2, 1), (3, 1))),

The first three entries indicate that those arguments are not sanitized. The last two show that those arguments taint the first argument. fprintf is special in that it has varargs and we don't have a symbol to handle that case. Earlier this year I included the PassThroughMapping which creates a varargs-like mapping of x -> x and x -> -1 for all x >= 0. This can be added with PTF instead of F:

    PTF("fgets", List((2, 1), (3, 1))),
    PTF("fprintf", List((2, 1), (3, 1))),

Order versus ArgumentIndex

Another thing that is worth mentioning is that argumentIndex != order. order is based on the position of the node in relation to its AST siblings. argumentIndex is similar but refers to the order with respect to AST siblings joined to the AST parent call via an additional ARGUMENT edge. This is usually > 0 and the receiver (in OOP) has a RECEIVER edge in addition to the AST edge.

This mostly makes a difference in OOP languages, but for safety, I would change your query to:

      val sink = cpg.call.nameExact("fprintf").argument(2)
      val src = cpg.call.nameExact("fgets").argument(1)

Another "optimisation" is using nameExact over name since the former uses indexing and the latter assumes regex.

Dec 12 '23 17:12 DavidBakerEffendi

joern joern copied to clipboard

reachableByFlows inacurate reports for format string C functions

Description

Version Information

Expected behavior

Observed behavior

Observed Paths (reachableByFlows)

Observed Paths (reachableBy)

Semantics

Order versus ArgumentIndex

joern
joern copied to clipboard