waymo-open-dataset icon indicating copy to clipboard operation
waymo-open-dataset copied to clipboard

TPU Execution for Occupancy Flow Tutorial

Open mosicr opened this issue 2 years ago • 2 comments

Hi, I am trying to run Occupancy Flow Tutorial on TPU. I modified the original code in a few places:

364 with strategy.scope(): 365 while step < num_steps_to_train: 366 # Iterate over batches of the dataset. 367 inputs = next(it) 368# loss_value = train_step(inputs) 369 loss_value = distributed_train_step(inputs)

Added strategy.scope and distributed_train_step to _make_model: 117with strategy.scope(): 118 def _make_model( .. [email protected] 164def distributed_train_step(dataset_inputs): 165 per_replica_losses = strategy.run(train_step,args=(dataset_inputs,)) 166 print(per_replica_losses) 167 return strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses, 168 axis=None)

It fails with: 457Traceback (most recent call last): 458 File "occup3.py", line 369, in 459 loss_value = distributed_train_step(inputs) ... 478ValueError: in user code: 479 480 occup3.py:165 distributed_train_step * 481 per_replica_losses = strategy.run(train_step,args=(dataset_inputs,)) 482 occup3.py:355 train_step * 483 optimizer.apply_gradients(zip(grads, model.trainable_weights)) 493 ValueError: Trying to create optimizer slot variable under the scope for tf.distri
bute.Strategy (<tensorflow.python.distribute.tpu_strategy.TPUStrategyV2 object at 0x7f
64971acd60>), which is different from the scope used for the original variable (<tf.Va
riable 'conv1_conv/kernel:0' shape=(7, 7, 23, 64) dtype=float32, numpy= 494 array([[[[ 4.82194126e-03, -2.13766117e-02, 5.23952395e-03, ...,

mosicr avatar Apr 15 '22 12:04 mosicr

Does it make a different if you use the same scope around earlier operations as well (when the timestep_grids and waypoints) are being created?

In case you have already solved it, can you share what fixed it for you?

rezama avatar May 20 '22 18:05 rezama

No luck so far. I experimented with scopes and @tf.function across the original code. I don't think it is possible to simply take the original code and add scope / decorate with @tf.function to generate a proper parallelizable / TPU executable code. Autograph is still not at the stage where you can do it. For example, if I decorate the original train_step function ( training should run on TPU ) with @tf.function it fails with: 931 File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/function.py", line 56
6, in call 932 raise ValueError( 933ValueError: Arguments and signature arguments do not match. got: 764, expected: 768

That means that autograph generates code and then later on catches the error in code it generated. This is probably a question for Autograph Team ( Dan Moldovan ).

mosicr avatar May 21 '22 10:05 mosicr