DeepRec
DeepRec copied to clipboard
Horovod couldn't get the right shapes for grads when using EmbeddingVariable
Describe the current behavior ValueError: Shapes (?, x) and (x,) must have the same rank
Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem.
import tensorflow as tf
from tensorflow.python.training import adam
import horovod.tensorflow as hvd
hvd.init()
var = tf.get_embedding_variable("var_0",
embedding_dim=3,
initializer=tf.ones_initializer(tf.float32),
partitioner=tf.fixed_size_partitioner(num_shards=4))
emb = tf.nn.embedding_lookup(var, tf.cast([0, 1, 2, 5, 6, 7], tf.int64))
fun = tf.multiply(emb, 2.0, name='multiply')
loss = tf.reduce_sum(fun, name='reduce_sum')
opt = adam.AdamOptimizer(0.1)
opt = hvd.DistributedOptimizer(opt)
g_v = opt.compute_gradients(loss)
train_op = opt.apply_gradients(g_v)
init = tf.global_variables_initializer()
bcast = hvd.broadcast_global_variables(0)
# Horovod: pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
with tf.Session(config=config) as sess:
sess.run([init])
bcast.run()
print(sess.run([emb, train_op, loss]))
print(sess.run([emb, train_op, loss]))
print(sess.run([emb, train_op, loss]))
Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
2022-08-27 23:35:07.879853: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/tensor_shape.py", line 928, in merge_with
self.assert_same_rank(other)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/tensor_shape.py", line 983, in assert_same_rank
(self, other))
ValueError: Shapes (?, 3) and (3,) must have the same rank
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/indexed_slices.py", line 154, in _type_spec
tensor_util.constant_value_as_shape(self._dense_shape))
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/tensor_shape.py", line 934, in merge_with
raise ValueError("Shapes %s and %s are not compatible" % (self, other))
ValueError: Shapes (?, 3) and (3,) are not compatible
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "embedding_variable_horovod.py", line 34, in <module>
g_v = opt.compute_gradients(loss)
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 541, in compute_gradients
avg_grads = self._allreduce_grads(grads, vars)
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 462, in allreduce_grads
for grad in grads]
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 462, in <listcomp>
for grad in grads]
File "/usr/local/lib/python3.6/dist-packages/horovod/tensorflow/__init__.py", line 298, in _allreduce_cond
allreduce_fn, id_fn)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1224, in cond
orig_res_t, res_t = context_t.BuildCondBranch(true_fn)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1077, in BuildCondBranch
self._BuildCondTensor, original_result, expand_composites=True)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/nest.py", line 532, in map_structure
flat_structure = [flatten(s, expand_composites) for s in structure]
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/nest.py", line 532, in <listcomp>
flat_structure = [flatten(s, expand_composites) for s in structure]
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/nest.py", line 263, in flatten
return _pywrap_tensorflow.Flatten(structure, expand_composites)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 2647, in Flatten
return _pywrap_tensorflow_internal.Flatten(nested, expand_composites)
SystemError: <built-in function Flatten> returned a result with an error set
Currently, SOK + Horovod is ready now, you can use SOK + Horovod (SOK is used for Embedding Variable gradients synchronization)