codeql icon indicating copy to clipboard operation
codeql copied to clipboard

[Python] False Negative: Taint flow fails to penetrate function-local class definitions

Open android1257 opened this issue 1 month ago • 4 comments

Description

I have identified a false negative in Python DataFlow analysis where taint tracking is lost when a class is defined inside a function.

If a tainted variable is passed as an argument to a function, and that argument is subsequently used inside a class defined within that function (a function-local class), CodeQL fails to track the data flow to the class instance's attributes.

However, if a similar logic is applied using a top-level (module-level) class, the data flow is detected correctly. This suggests an issue with how data flow is handled across the scope boundary of locally defined classes.

Reproduction Case (False Negative)

In this example, taint_src is passed to constructor_field_001_T. The class A is defined inside the function and captures taint_src. The flow to os.system is NOT detected.

import os

def constructor_field_001_T(taint_src):  
    # Class defined inside the function scope
    class A:
        def __init__(self):
            # ISSUE: The analyzer fails to track 'taint_src' from the 
            # outer function argument into this local class scope.
            self.data = taint_src       
            self.sani = '_'

    obj = A()
    taint_sink(obj.data)

def taint_sink(o):
    os.system(o)

if __name__ == "__main__":
    taint_src = "taint_src_value"
    constructor_field_001_T(taint_src)

Control Case (Working)

In this example, the class A is defined at the module level. The flow to os.system IS detected correctly.

import os

# Class defined at module level
class A:
    def __init__(self):
        # Accessing taint_src (as a global/captured in this context) works fine
        self.data = taint_src       
        self.sani = '_'

def constructor_field_001_T(taint_src):  
    obj = A()
    taint_sink(obj.data)

def taint_sink(o):
    os.system(o)

if __name__ == "__main__":
    taint_src = "taint_src_value"
    constructor_field_001_T(taint_src)

Additional Control Case (Working: Explicit Argument)

 Significantly, if I keep the class inside the function but pass taint_src as an explicit argument to __init__, the flow IS detected.

import os

def constructor_field_explicit_arg(taint_src):
    # Class defined inside function
    class A:
        # Explicit argument instead of capture
        def __init__(self, val):
            self.data = val

    # Passing taint explicitly
    obj = A(taint_src)
    taint_sink(obj.data)

def taint_sink(o):
    os.system(o)

if __name__ == "__main__":
    taint_src = "taint_src_value"
    constructor_field_explicit_arg(taint_src)

CodeQL Query Used

I am using a standard DataFlow::Global configuration looking for the specific string literal flowing to os.system.

Click to view query
/**
 * @name Python Taint Reproduction
 * @kind path-problem
 * @problem.severity error
 * @id py/taint-reproduction
 */

import python
import semmle.python.dataflow.new.DataFlow
import semmle.python.dataflow.new.TaintTracking

class TaintSource extends DataFlow::Node {
  TaintSource() {
    exists(StrConst str |
      str.getText() = "taint_src_value" and
      this.asExpr() = str
    )
  }
}

class DangerousSink extends DataFlow::Node {
  DangerousSink() {
    exists(Call call |
      (
        call.getFunc().(Attribute).getName() = "system" and
        call.getFunc().(Attribute).getObject().(Name).getId() = "os"
      ) and
      this.asExpr() = call.getAnArg()
    )
  }
}

module TaintConfig implements DataFlow::ConfigSig {
  predicate isSource(DataFlow::Node source) {
    source instanceof TaintSource
  }

  predicate isSink(DataFlow::Node sink) {
    sink instanceof DangerousSink
  }
}

module TaintFlow = TaintTracking::Global<TaintConfig>;
import TaintFlow::PathGraph

from TaintFlow::PathNode source, TaintFlow::PathNode sink
where TaintFlow::flowPath(source, sink)
select sink.getNode(), source, sink, "Taint flow detected"

Expected Behavior

CodeQL should be able to track the taint_src argument into the __init__ method of the locally defined class A, eventually leading to the os.system sink, just as it does for top-level classes.

android1257 avatar Dec 02 '25 09:12 android1257

These test cases can be made a little less confusing by renaming the parameter def constructor_field_001_T(taint_src) to param. This shows that both cases work fine reading a global (taint_src), but the class-within-function case demonstrates failure to capture an outer function's parameter (param).

smowton avatar Dec 02 '25 12:12 smowton

I've also tested a couple of related cases:

Works: reading an outer function's parameter:

def outer(param):
  def inner():
    return param

This will correctly pass taint through.

Doesn't work: reading an outer function's parameter from within a class definition:

def outer(param):
  class C:
    def member(self):
      return param
  return C().member()

smowton avatar Dec 02 '25 12:12 smowton

Thank you, @smowton, for the quick validation and for narrowing down the reproduction case. It is very helpful to know that standard nested functions handle this correctly while local classes do not. Could you shed some light on the potential root cause here? I am curious if this is considered a bug/gap in the current DataFlow library (e.g., missing modeling for variable capture across ClassDef boundaries), or if it is a deliberate design choice or known limitation (perhaps due to complexity or performance trade-offs)? Understanding whether this is intended behavior or an implementation gap would be very valuable for my analysis work.

android1257 avatar Dec 03 '25 08:12 android1257

Not 100% sure yet but likely just a missing case in the dataflow implementation.

smowton avatar Dec 03 '25 11:12 smowton