Adding duplicate tensor data sources during dataset enumeration
I am seeing in the logs multiple SSA variables corresponding to a single Python variable being added as tensor dataset sources. This might be okay; dataset sources aren't really the same as tensor sources not stemming from datasets. In other words, the dataset holds the tensors, whereas in the non-dataset case, the tensors are generated from some API (e.g., tf.ones()).
But, I'm unsure. It makes the test values look weird (where are there so many tensor variables in a function?). It may also cause confused when we start tracking shape/d-types for TF2 APIs. I don't think it would hurt just to have the final SSA variable be marked as a tensor dataset source (though one could question how we are representing dataset sources, particularly when it comes tracking shapes; should they be separate?).
One could also ask why, in the SSA, are multiple variables representing the same Python variable.
Example
This is what I am seeing:
# Test enumerate. The first element of the tuple returned isn't a tensor.
import tensorflow as tf
def f(a):
pass
def g(a):
pass
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
for step, element in enumerate(dataset, 1):
f(step)
g(element)
That's the input. In the logs, there are two SSA variables representing element, namely, v278 and v282:
106 v265 = invokeFunction < PythonLoader, LCodeBody, do()LRoot; > v3,v249,v258:#1 @106 exception:v266tf2_test_dataset11.py [16:21] -> [16:42] [265=[temp 3]3=[enumerate]249=[dataset]]
BB3
107 v269 = new <PythonLoader,Ltuple>@107 tf2_test_dataset11.py [3:0] -> [18:14]
108 v271 = global:global step tf2_test_dataset11.py [16:4] -> [16:8]
109 fieldref v269.v257:#0 = v271 = v271 tf2_test_dataset11.py [3:0] -> [18:14]
110 v273 = global:global element tf2_test_dataset11.py [16:10] -> [16:17]
111 fieldref v269.v259:#1 = v273 = v273 tf2_test_dataset11.py [3:0] -> [18:14]
112 v274 = a property name of v265 <no information> [265=[temp 3]]
113 v276 = fieldref v274.v257:#0 tf2_test_dataset11.py [16:4] -> [16:8] [276=[step]]
115 v278 = fieldref v274.v259:#1 tf2_test_dataset11.py [16:10] -> [16:17] [278=[element]]
117 v267 = binaryop(ne) v268:#null , v274 tf2_test_dataset11.py [3:0] -> [18:14]
118 conditional branch(eq, to iindex=-1) v267,v257:#0tf2_test_dataset11.py [3:0] -> [18:14]
BB4
119 v280 = new <PythonLoader,Ltuple>@119 tf2_test_dataset11.py [3:0] -> [18:14]
120 fieldref v280.v257:#0 = v276 = v276 tf2_test_dataset11.py [3:0] -> [18:14] [276=[step]]
121 fieldref v280.v259:#1 = v278 = v278 tf2_test_dataset11.py [3:0] -> [18:14] [278=[element]]
122 v279 = fieldref v265.v280 tf2_test_dataset11.py [3:0] -> [18:14] [265=[temp 3]]
123 v281 = fieldref v279.v257:#0 tf2_test_dataset11.py [16:4] -> [16:8] [281=[step]]
125 v282 = fieldref v279.v259:#1 tf2_test_dataset11.py [16:10] -> [16:17] [282=[element]]
Producing the following corresponding logs:
[INFO] Added dataflow source from tensor dataset: [Node: <Code body of function Lscript tf2_test_dataset11.py> Context: CallStringContext: [ com.ibm.wala.FakeRootClass.fakeRootMethod()V@2 ], v278]:[Empty].
[INFO] Added dataflow source from tensor dataset: [Node: <Code body of function Lscript tf2_test_dataset11.py> Context: CallStringContext: [ com.ibm.wala.FakeRootClass.fakeRootMethod()V@2 ], v282]:[Empty].
We really only need the second one, I think, unless there is some particular reason there are multiple SSA variables representing the same Python variable. But, since v278 is never referenced again, so I would say not.
I think this is happening because we are (interprocedurally) processing two different types of SSA instructions, those corresponding to for each statements and field reads:
[FINE] Processing instruction: 274 = a property name of 265.
[INFO] Using interprocedural analysis to find potential tensor iterable definition for use: 265 of instruction: 274 = a property name of 265.
[INFO] Added dataflow source from tensor dataset: [Node: <Code body of function Lscript tf2_test_dataset11.py> Context: CallStringContext: [ com.ibm.wala.FakeRootClass.fakeRootMethod()V@2 ], v278]:[Empty].
...
[FINE] Processing instruction: 279 = fieldref 265.280.
[INFO] Using interprocedural analysis to find potential tensor iterable definition for use: 265 of instruction: 279 = fieldref 265.280.
[INFO] Added dataflow source from tensor dataset: [Node: <Code body of function Lscript tf2_test_dataset11.py> Context: CallStringContext: [ com.ibm.wala.FakeRootClass.fakeRootMethod()V@2 ], v282]:[Empty].
I believe we did this to make the analysis more robust to different situations, i.e., there is some other situation where we are reading tensors from datasets but only one of these instruction pops up. But, since dataset reads aren't considered "tensor generators," this might be OK. But, we do add tensor data sources the same way in both cases. I'm unsure of whether they need to be distinguished, but right now at least they aren't seemingly causing a problem.