[Python] False Negative: Taint flow fails to penetrate function-local class definitions
Description
I have identified a false negative in Python DataFlow analysis where taint tracking is lost when a class is defined inside a function.
If a tainted variable is passed as an argument to a function, and that argument is subsequently used inside a class defined within that function (a function-local class), CodeQL fails to track the data flow to the class instance's attributes.
However, if a similar logic is applied using a top-level (module-level) class, the data flow is detected correctly. This suggests an issue with how data flow is handled across the scope boundary of locally defined classes.
Reproduction Case (False Negative)
In this example, taint_src is passed to constructor_field_001_T. The class A is defined inside the function and captures taint_src. The flow to os.system is NOT detected.
import os
def constructor_field_001_T(taint_src):
# Class defined inside the function scope
class A:
def __init__(self):
# ISSUE: The analyzer fails to track 'taint_src' from the
# outer function argument into this local class scope.
self.data = taint_src
self.sani = '_'
obj = A()
taint_sink(obj.data)
def taint_sink(o):
os.system(o)
if __name__ == "__main__":
taint_src = "taint_src_value"
constructor_field_001_T(taint_src)
Control Case (Working)
In this example, the class A is defined at the module level. The flow to os.system IS detected correctly.
import os
# Class defined at module level
class A:
def __init__(self):
# Accessing taint_src (as a global/captured in this context) works fine
self.data = taint_src
self.sani = '_'
def constructor_field_001_T(taint_src):
obj = A()
taint_sink(obj.data)
def taint_sink(o):
os.system(o)
if __name__ == "__main__":
taint_src = "taint_src_value"
constructor_field_001_T(taint_src)
Additional Control Case (Working: Explicit Argument)
Significantly, if I keep the class inside the function but pass taint_src as an explicit argument to __init__, the flow IS detected.
import os
def constructor_field_explicit_arg(taint_src):
# Class defined inside function
class A:
# Explicit argument instead of capture
def __init__(self, val):
self.data = val
# Passing taint explicitly
obj = A(taint_src)
taint_sink(obj.data)
def taint_sink(o):
os.system(o)
if __name__ == "__main__":
taint_src = "taint_src_value"
constructor_field_explicit_arg(taint_src)
CodeQL Query Used
I am using a standard DataFlow::Global configuration looking for the specific string literal flowing to os.system.
Click to view query
/**
* @name Python Taint Reproduction
* @kind path-problem
* @problem.severity error
* @id py/taint-reproduction
*/
import python
import semmle.python.dataflow.new.DataFlow
import semmle.python.dataflow.new.TaintTracking
class TaintSource extends DataFlow::Node {
TaintSource() {
exists(StrConst str |
str.getText() = "taint_src_value" and
this.asExpr() = str
)
}
}
class DangerousSink extends DataFlow::Node {
DangerousSink() {
exists(Call call |
(
call.getFunc().(Attribute).getName() = "system" and
call.getFunc().(Attribute).getObject().(Name).getId() = "os"
) and
this.asExpr() = call.getAnArg()
)
}
}
module TaintConfig implements DataFlow::ConfigSig {
predicate isSource(DataFlow::Node source) {
source instanceof TaintSource
}
predicate isSink(DataFlow::Node sink) {
sink instanceof DangerousSink
}
}
module TaintFlow = TaintTracking::Global<TaintConfig>;
import TaintFlow::PathGraph
from TaintFlow::PathNode source, TaintFlow::PathNode sink
where TaintFlow::flowPath(source, sink)
select sink.getNode(), source, sink, "Taint flow detected"
Expected Behavior
CodeQL should be able to track the taint_src argument into the __init__ method of the locally defined class A, eventually leading to the os.system sink, just as it does for top-level classes.
These test cases can be made a little less confusing by renaming the parameter def constructor_field_001_T(taint_src) to param. This shows that both cases work fine reading a global (taint_src), but the class-within-function case demonstrates failure to capture an outer function's parameter (param).
I've also tested a couple of related cases:
Works: reading an outer function's parameter:
def outer(param):
def inner():
return param
This will correctly pass taint through.
Doesn't work: reading an outer function's parameter from within a class definition:
def outer(param):
class C:
def member(self):
return param
return C().member()
Thank you, @smowton, for the quick validation and for narrowing down the reproduction case. It is very helpful to know that standard nested functions handle this correctly while local classes do not. Could you shed some light on the potential root cause here? I am curious if this is considered a bug/gap in the current DataFlow library (e.g., missing modeling for variable capture across ClassDef boundaries), or if it is a deliberate design choice or known limitation (perhaps due to complexity or performance trade-offs)? Understanding whether this is intended behavior or an implementation gap would be very valuable for my analysis work.
Not 100% sure yet but likely just a missing case in the dataflow implementation.