tf-keras
tf-keras copied to clipboard
New optimizers fail to load CUDA installed through conda
System information.
- Have I written custom code (as opposed to using a stock example script provided in Keras): no
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04 (WSL)
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below): 2.11
- Python version: 3.9
- Bazel version (if compiling from source): N/A
- GPU model and memory: RTX 2080 Ti
- Exact command to reproduce:
- Create a new environment, following the official installation instructions from here https://www.tensorflow.org/install/pip#linux:
conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/
pip install tensorflow
- Run the beginner MNIST tutorial (or any other tutorial that calls
fit
) from here https://keras.io/examples/vision/mnist_convnet/
Describe the problem.
An error is raised:
libdevice not found at ./libdevice.10.bc
Note that if you switch to using the legacy optimizers, by switching this line
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
to this
model.compile(loss="categorical_crossentropy", optimizer=keras.optimizers.legacy.Adam(), metrics=["accuracy"])
then the example runs successfully.
Describe the current behavior.
An error occurs when running the example.
Describe the expected behavior.
The example should run without error, as it does when using the legacy optimizers.
- Do you want to contribute a PR? (yes/no): no
Standalone code to reproduce the issue.
https://keras.io/examples/vision/mnist_convnet/
Source code / logs.
Full stack trace of the error:
File ".../tmp.py", line 47, in <module>
model.fit(x_train, y_train, batch_size=batch_size, epochs=epochs, validation_split=0.1)
File "/home/drasmuss/mambaforge/envs/tmp2/lib/python3.9/site-packages/keras/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/home/drasmuss/mambaforge/envs/tmp2/lib/python3.9/site-packages/keras/engine/training.py", line 1650, in fit
tmp_logs = self.train_function(iterator)
File "/home/drasmuss/mambaforge/envs/tmp2/lib/python3.9/site-packages/keras/engine/training.py", line 1249, in train_function
return step_function(self, iterator)
File "/home/drasmuss/mambaforge/envs/tmp2/lib/python3.9/site-packages/keras/engine/training.py", line 1233, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/home/drasmuss/mambaforge/envs/tmp2/lib/python3.9/site-packages/keras/engine/training.py", line 1222, in run_step
outputs = model.train_step(data)
File "/home/drasmuss/mambaforge/envs/tmp2/lib/python3.9/site-packages/keras/engine/training.py", line 1027, in train_step
self.optimizer.minimize(loss, self.trainable_variables, tape=tape)
File "/home/drasmuss/mambaforge/envs/tmp2/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 527, in minimize
self.apply_gradients(grads_and_vars)
File "/home/drasmuss/mambaforge/envs/tmp2/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1140, in apply_gradients
return super().apply_gradients(grads_and_vars, name=name)
File "/home/drasmuss/mambaforge/envs/tmp2/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 634, in apply_gradients
iteration = self._internal_apply_gradients(grads_and_vars)
File "/home/drasmuss/mambaforge/envs/tmp2/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1166, in _internal_apply_gradients
return tf.__internal__.distribute.interim.maybe_merge_call(
File "/home/drasmuss/mambaforge/envs/tmp2/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1216, in _distributed_apply_gradients_fn
distribution.extended.update(
File "/home/drasmuss/mambaforge/envs/tmp2/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py", line 1211, in apply_grad_to_update_var
return self._update_step_xla(grad, var, id(self._var_key(var)))
Node: 'StatefulPartitionedCall_4'
libdevice not found at ./libdevice.10.bc
[[{{node StatefulPartitionedCall_4}}]] [Op:__inference_train_function_1026]
Likely related to:
- https://github.com/tensorflow/tensorflow/issues/56927
- https://github.com/tensorflow/tensorflow/issues/59013