waymo-open-dataset
waymo-open-dataset copied to clipboard
TPU Execution for Occupancy Flow Tutorial
Hi, I am trying to run Occupancy Flow Tutorial on TPU. I modified the original code in a few places:
364 with strategy.scope(): 365 while step < num_steps_to_train: 366 # Iterate over batches of the dataset. 367 inputs = next(it) 368# loss_value = train_step(inputs) 369 loss_value = distributed_train_step(inputs)
Added strategy.scope and distributed_train_step to _make_model: 117with strategy.scope(): 118 def _make_model( .. [email protected] 164def distributed_train_step(dataset_inputs): 165 per_replica_losses = strategy.run(train_step,args=(dataset_inputs,)) 166 print(per_replica_losses) 167 return strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses, 168 axis=None)
It fails with:
457Traceback (most recent call last):
458 File "occup3.py", line 369, in
bute.Strategy (<tensorflow.python.distribute.tpu_strategy.TPUStrategyV2 object at 0x7f
64971acd60>), which is different from the scope used for the original variable (<tf.Va
riable 'conv1_conv/kernel:0' shape=(7, 7, 23, 64) dtype=float32, numpy=
494 array([[[[ 4.82194126e-03, -2.13766117e-02, 5.23952395e-03, ...,
Does it make a different if you use the same scope around earlier operations as well (when the timestep_grids and waypoints) are being created?
In case you have already solved it, can you share what fixed it for you?
No luck so far. I experimented with scopes and @tf.function across the original code.
I don't think it is possible to simply take the original code and add scope / decorate with @tf.function to generate a proper parallelizable / TPU executable code. Autograph is still not at the stage where you can do it.
For example, if I decorate the original train_step function ( training should run on TPU ) with @tf.function it fails with:
931 File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 56
6, in call
932 raise ValueError(
933ValueError: Arguments and signature arguments do not match. got: 764, expected: 768
That means that autograph generates code and then later on catches the error in code it generated. This is probably a question for Autograph Team ( Dan Moldovan ).