keras icon indicating copy to clipboard operation
keras copied to clipboard

tf.keras.mixed_precision.LossScaleOptimizer causes Graph execution error when using tfa.optimizers.MultiOptimizer and mixed_precision

Open Farbdose opened this issue 3 years ago • 3 comments

System information.

  • Have I written custom code (as opposed to using a stock example script provided in Keras): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04.4 LTS (4.15.0-191-generic)
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): v2.10.0-rc3-6-g359c3cdfc5f 2.10.0
  • Python version: 3.10.6
  • Bazel version (if compiling from source):
  • GPU model and memory: GPU 0: NVIDIA GeForce GTX TITAN X 12GB
  • Exact command to reproduce:

Describe the problem.

I have a very large custom codebase and am trying to train a model. If I try to enable mixed_precision AND und tfa.optimizers.MultiOptimizer at the same time I get a very strange Graph Exection Error caused by something in the LossScaleOptimizer If this is something that should be posted to StackOverflow instead, please let me know.

Describe the current behavior.

Execution stops with an InvalidArgumentError: Graph execution error Also, not using MultiOptimizer seems to prevent the error from occurring.

Describe the expected behavior. I would expect it run without errors or at least give an error that tells me what is wrong in a way that I can understand. I would appreciate some information on what could even cause an error like this.

Contributing.

  • Do you want to contribute a PR? (yes/no): no (I have no idea whats happening)

Standalone code to reproduce the issue.

This is unfortunally not possible as I have no Idea which part of my code causes this and I can't post my entire master's thesis here...

The error seems to have something to do with a numeric instability in my model. After rounding the output of one of my submodels the error goes away which confuses me even more. I have to round to 6(or less) decimal digits for it to go away. Also switching the dtype of my model to float16 instead of float32 triggers the error again (even with rounding)

Source code / logs.


Detected at node 'cond_2/update_0/AssignAddVariableOp' defined at (most recent call last):
    ...
      history = model.fit(
    File "/home/.conda/envs/msc-thesis/lib/python3.10/site-packages/wandb/integration/keras/keras.py", line 174, in new_v2
      return old_v2(*args, **kwargs)
    File "/home/.conda/envs/msc-thesis/lib/python3.10/site-packages/wandb/integration/keras/keras.py", line 174, in new_v2
      return old_v2(*args, **kwargs)
    File "/home/.conda/envs/msc-thesis/lib/python3.10/site-packages/wandb/integration/keras/keras.py", line 174, in new_v2
      return old_v2(*args, **kwargs)
    File "/home/.conda/envs/msc-thesis/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
      return fn(*args, **kwargs)
    File "/home/.conda/envs/msc-thesis/lib/python3.10/site-packages/keras/engine/training.py", line 1564, in fit
      tmp_logs = self.train_function(iterator)
    File "/home/.conda/envs/msc-thesis/lib/python3.10/site-packages/keras/engine/training.py", line 1160, in train_function
      return step_function(self, iterator)
    File "/home/.conda/envs/msc-thesis/lib/python3.10/site-packages/keras/engine/training.py", line 1146, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/.conda/envs/msc-thesis/lib/python3.10/site-packages/keras/mixed_precision/loss_scale_optimizer.py", line 861, in _apply_gradients_cross_replica
      maybe_apply_op = tf.__internal__.smart_cond.smart_cond(
    File "/home/.conda/envs/msc-thesis/lib/python3.10/site-packages/keras/mixed_precision/loss_scale_optimizer.py", line 821, in do_not_apply_fn
      return self._optimizer.iterations.assign_add(1, read_value=False)
Node: 'cond_2/update_0/AssignAddVariableOp'
Cannot update variable with shape [0] using a Tensor with shape [], shapes must be equal.
	 [[{{node cond_2/update_0/AssignAddVariableOp}}]] [Op:__inference_fn_with_cond_79222]

Farbdose avatar Nov 03 '22 17:11 Farbdose

@Farbdose, To expedite the trouble-shooting process, could you please provide a complete code you are using. Thank you!

tilakrayal avatar Nov 04 '22 12:11 tilakrayal

@tilakrayal I'm really sorry but I'm unable to do that. It's simply too much code spread over too many files to just select the relevant part and post it here. If I would be able to create a reproducible sample I would already have done so 😓

Farbdose avatar Nov 05 '22 18:11 Farbdose

@Farbdose, Without the reproducible code, it would be difficult for us to debug the issue. In order to expedite the trouble-shooting process, could you please provide a minimal code snippet you are using. Thank you!

tilakrayal avatar Nov 09 '22 08:11 tilakrayal