tf-keras icon indicating copy to clipboard operation
tf-keras copied to clipboard

Dict-of-tensors custom Metric using MirroredStrategy - tf.identity error?

Open mwalmsley opened this issue 3 years ago • 6 comments
trafficstars

System information.

  • Have I written custom code (as opposed to using a stock example script provided in Keras): I used a stock example script with custom code added (see minimal example)
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): centOS
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 2.10 (latest)
  • Python version: 3.7
  • Bazel version (if compiling from source): NA
  • GPU model and memory: 2xA100 in practice, NA in Colab
  • Exact command to reproduce: run all on Colab

Describe the problem.

tf.keras.Metric object with return() function that returns a dict of tensors fails when using MirroredStrategy() context (i.e. on multiple GPUs).

Describe the current behavior.

When return() returns a dict of tensors, error is raised:

   .../keras/utils/metrics_utils.py", line 177, in merge_fn_wrapper  **
        return tf.identity(result)
**TypeError: Expected any non-tensor type, but got a tensor instead.**

Using minimal failing example on Colab, the slightly more descriptive error is raised:


        return tf.identity(result)

    TypeError: Failed to convert elements of {'some_value': SyncOnReadVariable:{
      0: <tf.Variable 'some_value:0' shape=() dtype=float32>
    }, 'some_other_value': SyncOnReadVariable:{
      0: <tf.Variable 'some_other_value:0' shape=() dtype=float32>
    }} to Tensor. Consider casting elements to a supported type. See https://www.tensorflow.org/api_docs/python/tf/dtypes for supported TF dtypes.

From inspecting the relevant source here, it seems like:

  • tf.internal.distribute.strategy_supports_no_merge_call() is False, causing the elif in 142 designed to handle dict outputs to be skipped.
  • the alternative block, 158, includes tf.identity(result) which is not compatible with a dict (as dict cannot be converted to tensor), raising the errors above
  • there is an outstanding TODO (psv) to check said alternative block under different distribution strategies (159)

Identical code:

  • Works correctly on one GPU
  • Works correctly on multiple GPUs when returning a single tensor (aka return some_dict[some_key] works but return some_dict fails)

The Metric is defined with the MirroredStrategy() context manager and tf.print shows the dict is well-formed (see also minimal example in Colab)

Describe the expected behavior.

tf.keras.Metric object should support dict-of-tensors in both normal and MirroredStrategy() context (ideally). Or, docs should be updated to reflect that dict-of-tensors return value is only supported outside MirroredStrategy() etc contexts.

Contributing.

  • Do you want to contribute a PR? (yes/no): I am happy to PR this but would need advice on interpreting the current code
  • If yes, please read this page for instructions
  • Briefly describe your candidate solution(if contributing):

Standalone code to reproduce the issue.

Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.

https://colab.research.google.com/drive/1OdpTq6BSiV1JfFvr-Ld5iIWiKX5u_VPI?usp=sharing

Source code / logs.

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

Cross-posted on https://discuss.tensorflow.org/t/mirroredstrategy-dict-of-tensors-metric-cryptic-error/12839

Thank you for your time and your work building Keras!

mwalmsley avatar Nov 05 '22 14:11 mwalmsley

@mwalmsley, I was facing a different error while executing the mentioned code. Kindly find the gist and let us know if you are facing the same error. Thank you!

tilakrayal avatar Nov 07 '22 15:11 tilakrayal

Hi @tilakrayal ,

Thanks for checking. I'm fairly sure the error you see is because you added pip install tf-nightly to the top of the script, which seems to cause an environment/CUDA issue when running the model (which was copy-pasted exactly from the tutorial).

File "/usr/local/lib/python3.7/dist-packages/keras/backend.py", line 5369, in relu x = tf.nn.relu(x) Node: 'sequential/conv2d/Relu' 2 root error(s) found. (0) UNIMPLEMENTED: DNN library is not found. [[{{node sequential/conv2d/Relu}}]]

I'm pretty sure x = tf.nn.relu(x) is standard code, and the error relates to an incompatible environment after that install.

If I delete the environment change, your gist gives the error for which I am raising this issue.

mwalmsley avatar Nov 07 '22 17:11 mwalmsley

Was able to replicate by removing the pip install in this gist (link)

jbischof avatar Nov 09 '22 21:11 jbischof

I have been having this problem for quite some time and I can't find a fix, any updates?

YannPourcenoux avatar Jan 26 '23 12:01 YannPourcenoux

Sorry but currently the team doesn't have enough bandwidth for a deeper look into this - we'd appreciate contributions in the meantime.

rchao avatar Jan 26 '23 22:01 rchao

Also having this issue for quite some time, are there any updates?

michaelvay avatar Oct 24 '23 11:10 michaelvay