DeepRec icon indicating copy to clipboard operation
DeepRec copied to clipboard

No OpKernel was registered to support Op 'PreprocessingForward' Error for Multi Machine, Multi GPU

Open wangcaihua opened this issue 1 year ago • 0 comments

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 20.04): Linux Ubuntu 20.04, Offical GPU Image 2304
  • DeepRec version or commit id: deeprec2302
  • Python version: 3.8.10
  • Bazel version (if compiling from source): not compiling from source
  • GCC/Compiler version (if compiling from source): gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
  • CUDA/cuDNN version: 11.6

Describe the current behavior [1,9]:Traceback (most recent call last): [1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call [1,9]: return fn(*args) [1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1348, in _run_fn [1,9]: self._extend_graph() [1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1388, in _extend_graph [1,9]: tf_session.ExtendSession(self._session) [1,9]:tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'PreprocessingForward' used by {{node input_layer/input_layer/group_embedding_lookup/PreprocessingForward/PreprocessingForward}}with these attrs: [rank=9, id_in_local_rank=0, num_ranks=16, num_gpus=16, Toffsets=DT_INT64, Tindices=DT_INT64, num_lookups=26, combiners=["mean", "mean", "mean", "mean", "mean", ..., "mean", "mean", "mean", "mean", "mean"], dimensions=[16, 16, 16, 16, 16, ..., 16, 16, 16, 16, 16], shard=[-1, -1, -1, -1, -1, ..., -1, -1, -1, -1, -1]] [1,9]:Registered devices: [CPU, XLA_CPU] [1,9]:Registered kernels: [1,9]: device='GPU'; Tindices in [DT_INT32]; Toffsets in [DT_INT32] [1,9]: device='GPU'; Tindices in [DT_INT32]; Toffsets in [DT_INT64] [1,9]: device='GPU'; Tindices in [DT_INT64]; Toffsets in [DT_INT32] [1,9]: device='GPU'; Tindices in [DT_INT64]; Toffsets in [DT_INT64] [1,9]: [1,9]: [[input_layer/input_layer/group_embedding_lookup/PreprocessingForward/PreprocessingForward]] [1,9]: [1,9]:During handling of the above exception, another exception occurred: [1,9]: [1,9]:Traceback (most recent call last): [1,9]: File "train.py", line 887, in [1,9]: main() [1,9]: File "train.py", line 642, in main [1,9]: train(sess_config, hooks, model, train_init_op, train_steps, [1,9]: File "train.py", line 505, in train [1,9]: with tf.train.MonitoredTrainingSession( [1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 655, in MonitoredTrainingSession [1,9]: return MonitoredSession( [1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1085, in init [1,9]: super(MonitoredSession, self).init( [1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 800, in init [1,9]: self._sess = _RecoverableSession(self._coordinated_creator) [1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1282, in init [1,9]: _WrappedSession.init(self, self._create_session()) [1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 1287, in _create_session [1,9]: return self._sess_creator.create_session() [1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 953, in create_session [1,9]: self.tf_sess = self._session_creator.create_session() [1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/monitored_session.py", line 713, in create_session [1,9]: return self._get_session_manager().prepare_session( [1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/training/session_manager.py", line 306, in prepare_session [1,9]: sess.run(init_op, feed_dict=init_feed_dict) [1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 955, in run [1,9]: result = self._run(None, fetches, feed_dict, options_ptr, [1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1179, in _run [1,9]: results = self._do_run(handle, final_targets, final_fetches, [1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1358, in _do_run [1,9]: return self._do_call(_run_fn, feeds, fetches, targets, options, [1,9]: File "/usr/local/lib/python3.8/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call [1,9]: raise type(e)(node_def, op, message) [1,9]:tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'PreprocessingForward' used by node input_layer/input_layer/group_embedding_lookup/PreprocessingF[1,9]:orward/PreprocessingForward (defined at /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/framework/ops.py:1748) with these attrs: [rank=9, id_in_local_rank=0, num_ranks=16, num_gpus=16, Toffsets=DT_INT64, Tindices=DT_INT64, num_lookups=26, combiners=["mean", "mean", "mean", "mean", "mean", ..., "mean", "mean", "mean", "mean", "mean"], dimensions=[16, 16, 16, 16, 16, ..., 16, 16, 16, 16, 16], shard=[-1, -1, -1, -1, -1, ..., -1, -1, -1, -1, -1]] [1,9]:Registered devices: [CPU, XLA_CPU] [1,9]:Registered kernels: [1,9]: device='GPU'; Tindices in [DT_INT32]; Toffsets in [DT_INT32] [1,9]: device='GPU'; Tindices in [DT_INT32]; Toffsets in [DT_INT64] [1,9]: device='GPU'; Tindices in [DT_INT64]; Toffsets in [DT_INT32] [1,9]: device='GPU'; Tindices in [DT_INT64]; Toffsets in [DT_INT64] [1,9]: [1,9]: [[input_layer/input_layer/group_embedding_lookup/PreprocessingForward/PreprocessingForward]]

Describe the expected behavior

Code to reproduce the issue

Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

wangcaihua avatar Apr 24 '23 02:04 wangcaihua