tf-keras
tf-keras copied to clipboard
Dict-of-tensors custom Metric using MirroredStrategy - tf.identity error?
System information.
- Have I written custom code (as opposed to using a stock example script provided in Keras): I used a stock example script with custom code added (see minimal example)
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): centOS
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below): 2.10 (latest)
- Python version: 3.7
- Bazel version (if compiling from source): NA
- GPU model and memory: 2xA100 in practice, NA in Colab
- Exact command to reproduce: run all on Colab
Describe the problem.
tf.keras.Metric object with return() function that returns a dict of tensors fails when using MirroredStrategy() context (i.e. on multiple GPUs).
Describe the current behavior.
When return() returns a dict of tensors, error is raised:
.../keras/utils/metrics_utils.py", line 177, in merge_fn_wrapper **
return tf.identity(result)
**TypeError: Expected any non-tensor type, but got a tensor instead.**
Using minimal failing example on Colab, the slightly more descriptive error is raised:
return tf.identity(result)
TypeError: Failed to convert elements of {'some_value': SyncOnReadVariable:{
0: <tf.Variable 'some_value:0' shape=() dtype=float32>
}, 'some_other_value': SyncOnReadVariable:{
0: <tf.Variable 'some_other_value:0' shape=() dtype=float32>
}} to Tensor. Consider casting elements to a supported type. See https://www.tensorflow.org/api_docs/python/tf/dtypes for supported TF dtypes.
From inspecting the relevant source here, it seems like:
- tf.internal.distribute.strategy_supports_no_merge_call() is False, causing the elif in 142 designed to handle dict outputs to be skipped.
- the alternative block, 158, includes
tf.identity(result)which is not compatible with a dict (as dict cannot be converted to tensor), raising the errors above - there is an outstanding TODO (psv) to check said alternative block under different distribution strategies (159)
Identical code:
- Works correctly on one GPU
- Works correctly on multiple GPUs when returning a single tensor (aka
return some_dict[some_key]works butreturn some_dictfails)
The Metric is defined with the MirroredStrategy() context manager and tf.print shows the dict is well-formed (see also minimal example in Colab)
Describe the expected behavior.
tf.keras.Metric object should support dict-of-tensors in both normal and MirroredStrategy() context (ideally). Or, docs should be updated to reflect that dict-of-tensors return value is only supported outside MirroredStrategy() etc contexts.
- Do you want to contribute a PR? (yes/no): I am happy to PR this but would need advice on interpreting the current code
- If yes, please read this page for instructions
- Briefly describe your candidate solution(if contributing):
Standalone code to reproduce the issue.
Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.
https://colab.research.google.com/drive/1OdpTq6BSiV1JfFvr-Ld5iIWiKX5u_VPI?usp=sharing
Source code / logs.
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.
Cross-posted on https://discuss.tensorflow.org/t/mirroredstrategy-dict-of-tensors-metric-cryptic-error/12839
Thank you for your time and your work building Keras!
@mwalmsley, I was facing a different error while executing the mentioned code. Kindly find the gist and let us know if you are facing the same error. Thank you!
Hi @tilakrayal ,
Thanks for checking. I'm fairly sure the error you see is because you added pip install tf-nightly to the top of the script, which seems to cause an environment/CUDA issue when running the model (which was copy-pasted exactly from the tutorial).
File "/usr/local/lib/python3.7/dist-packages/keras/backend.py", line 5369, in relu x = tf.nn.relu(x) Node: 'sequential/conv2d/Relu' 2 root error(s) found. (0) UNIMPLEMENTED: DNN library is not found. [[{{node sequential/conv2d/Relu}}]]
I'm pretty sure x = tf.nn.relu(x) is standard code, and the error relates to an incompatible environment after that install.
If I delete the environment change, your gist gives the error for which I am raising this issue.
Was able to replicate by removing the pip install in this gist (link)
I have been having this problem for quite some time and I can't find a fix, any updates?
Sorry but currently the team doesn't have enough bandwidth for a deeper look into this - we'd appreciate contributions in the meantime.
Also having this issue for quite some time, are there any updates?