DeepRec icon indicating copy to clipboard operation
DeepRec copied to clipboard

Horovod couldn't get the right shapes for grads when using EmbeddingVariable

Open fuhailin opened this issue 2 years ago • 1 comments

Describe the current behavior ValueError: Shapes (?, x) and (x,) must have the same rank

Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem.

import tensorflow as tf
from tensorflow.python.training import adam
import horovod.tensorflow as hvd

hvd.init()

var = tf.get_embedding_variable("var_0",
                                embedding_dim=3,
                                initializer=tf.ones_initializer(tf.float32),
                                partitioner=tf.fixed_size_partitioner(num_shards=4))


emb = tf.nn.embedding_lookup(var, tf.cast([0, 1, 2, 5, 6, 7], tf.int64))
fun = tf.multiply(emb, 2.0, name='multiply')
loss = tf.reduce_sum(fun, name='reduce_sum')
opt = adam.AdamOptimizer(0.1)
opt = hvd.DistributedOptimizer(opt)


g_v = opt.compute_gradients(loss)

train_op = opt.apply_gradients(g_v)

init = tf.global_variables_initializer()
bcast = hvd.broadcast_global_variables(0)

# Horovod: pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())

with tf.Session(config=config) as sess:
    sess.run([init])
    bcast.run()
    print(sess.run([emb, train_op, loss]))
    print(sess.run([emb, train_op, loss]))
    print(sess.run([emb, train_op, loss]))

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

2022-08-27 23:35:07.879853: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/tensor_shape.py", line 928, in merge_with
    self.assert_same_rank(other)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/tensor_shape.py", line 983, in assert_same_rank
    (self, other))
ValueError: Shapes (?, 3) and (3,) must have the same rank

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/indexed_slices.py", line 154, in _type_spec
    tensor_util.constant_value_as_shape(self._dense_shape))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/tensor_shape.py", line 934, in merge_with
    raise ValueError("Shapes %s and %s are not compatible" % (self, other))
ValueError: Shapes (?, 3) and (3,) are not compatible

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "embedding_variable_horovod.py", line 34, in <module>
    g_v = opt.compute_gradients(loss)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 541, in compute_gradients
    avg_grads = self._allreduce_grads(grads, vars)
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 462, in allreduce_grads
    for grad in grads]
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 462, in <listcomp>
    for grad in grads]
  File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 298, in _allreduce_cond
    allreduce_fn, id_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1224, in cond
    orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1077, in BuildCondBranch
    self._BuildCondTensor, original_result, expand_composites=True)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/nest.py", line 532, in map_structure
    flat_structure = [flatten(s, expand_composites) for s in structure]
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/nest.py", line 532, in <listcomp>
    flat_structure = [flatten(s, expand_composites) for s in structure]
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/nest.py", line 263, in flatten
    return _pywrap_tensorflow.Flatten(structure, expand_composites)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 2647, in Flatten
    return _pywrap_tensorflow_internal.Flatten(nested, expand_composites)
SystemError: <built-in function Flatten> returned a result with an error set

fuhailin avatar Aug 27 '22 23:08 fuhailin

Currently, SOK + Horovod is ready now, you can use SOK + Horovod (SOK is used for Embedding Variable gradients synchronization)

liutongxuan avatar Apr 12 '23 05:04 liutongxuan