recommenders-addons icon indicating copy to clipboard operation
recommenders-addons copied to clipboard

fail to run demo:movielens-1m-keras-with-horovod

Open W-O-W opened this issue 1 year ago • 2 comments

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Debian GNU/Linux 10 (buster)
  • TensorFlow version and how it was installed (source or binary):2.15.1 installed by pip
  • TensorFlow-Recommenders-Addons version and how it was installed (source or binary):0.7.2 installed by pip
  • Python version:python 3.12.4,openmpi 4.1.6,horovod 0.28.1
  • Is GPU used? (yes/no):no

Describe the bug fail to run demo:movielens-1m-keras-with-horovod. [1,1]:2024-09-03 17:36:49.479358: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at xla_ops.cc:791 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1

Code to reproduce the issue I remove LayerNormalization op.then execute this command: horovodrun -np 2 python movielens-1m-keras-with-horovod.py --mode="train" --model_dir="./model_dir" --export_dir="./export_dir"
--steps_per_epoch=${1:-20000} --shuffle=${2:-True}

Other info / logs

[1,1]:2024-09-03 17:36:47.334739: I external/local_xla/xla/service/service.cc:168] XLA service 0x7f830800cb80 initialized for platform Host (this does not guarantee that XLA will be used). Devices: [1,1]:2024-09-03 17:36:47.334775: I external/local_xla/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version [1,1]:WARNING: All log messages before absl::InitializeLog() is called are written to STDERR [1,1]:I0000 00:00:1725356207.355120 11294 device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process. [1,1]:2024-09-03 17:36:47.355318: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream. [1,1]:2024-09-03 17:36:47.355394: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream. [1,0]:2024-09-03 17:36:49.398339: I external/local_xla/xla/service/service.cc:168] XLA service 0x7f803000a400 initialized for platform Host (this does not guarantee that XLA will be used). Devices: [1,0]:2024-09-03 17:36:49.398366: I external/local_xla/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version [1,0]:WARNING: All log messages before absl::InitializeLog() is called are written to STDERR [1,0]:I0000 00:00:1725356209.419932 11286 device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process. [1,0]:2024-09-03 17:36:49.420144: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream. [1,0]:2024-09-03 17:36:49.423370: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream. [1,0]:2024-09-03 17:36:49.475450: E external/local_xla/xla/stream_executor/stream_executor_internal.h:177] SetPriority unimplemented for this stream. [1,1]:2024-09-03 17:36:49.479112: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at scatter_nd_op.cc:115 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1 [1,1]: [1,1]:Stack trace for op definition: [1,1]:dummy_file_name:10:dummy_function_name [1,1]: [1,0]:2024-09-03 17:36:49.479265: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at scatter_nd_op.cc:115 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1 [1,0]: [1,0]:Stack trace for op definition: [1,0]:dummy_file_name:10:dummy_function_name [1,0]: [1,1]:2024-09-03 17:36:49.479358: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at xla_ops.cc:791 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1 [1,1]: [1,1]:Stack trace for op definition: [1,1]:dummy_file_name:10:dummy_function_name [1,1]: [1,1]: [[{{function_node __forward_call_1818}}{{node movie_DenseUnifiedEmbeddingLayer/movie_DenseUnifiedEmbeddingLayer/ScatterNd}}]] [1,1]: tf2xla conversion failed while converting cluster_5[_XlaCompiledKernel=true,_XlaHasReferenceVars=false,_XlaNumConstantArgs=4,_XlaNumResourceArgs=0]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions. [1,0]:2024-09-03 17:36:49.479510: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at xla_ops.cc:791 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1 [1,0]: [1,0]:Stack trace for op definition: [1,0]:dummy_file_name:10:dummy_function_name [1,0]: [1,0]: [[{{function_node __forward_call_1823}}{{node movie_DenseUnifiedEmbeddingLayer/movie_DenseUnifiedEmbeddingLayer/ScatterNd}}]] [1,0]: tf2xla conversion failed while converting cluster_5[_XlaCompiledKernel=true,_XlaHasReferenceVars=false,_XlaNumConstantArgs=4,_XlaNumResourceArgs=0]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions. [1,1]:2024-09-03 17:36:49.482065: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at scatter_nd_op.cc:115 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1 [1,1]: [1,1]:Stack trace for op definition: [1,1]:dummy_file_name:10:dummy_function_name [1,1]: [1,0]:2024-09-03 17:36:49.482107: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at scatter_nd_op.cc:115 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1 [1,0]: [1,0]:Stack trace for op definition: [1,0]:dummy_file_name:10:dummy_function_name [1,0]: [1,1]:2024-09-03 17:36:49.482270: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at xla_ops.cc:791 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1 [1,1]: [1,1]:Stack trace for op definition: [1,1]:dummy_file_name:10:dummy_function_name [1,1]: [1,1]: [[{{function_node __forward_call_1818}}{{node user_DenseUnifiedEmbeddingLayer/user_DenseUnifiedEmbeddingLayer/ScatterNd}}]] [1,1]: tf2xla conversion failed while converting cluster_6[_XlaCompiledKernel=true,_XlaHasReferenceVars=false,_XlaNumConstantArgs=4,_XlaNumResourceArgs=0]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions. [1,0]:2024-09-03 17:36:49.482339: W tensorflow/core/framework/op_kernel.cc:1839] OP_REQUIRES failed at xla_ops.cc:791 : INVALID_ARGUMENT: Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1 [1,0]: [1,0]:Stack trace for op definition: [1,0]:dummy_file_name:10:dummy_function_name [1,0]: [1,0]: [[{{function_node __forward_call_1823}}{{node user_DenseUnifiedEmbeddingLayer/user_DenseUnifiedEmbeddingLayer/ScatterNd}}]] [1,0]: tf2xla conversion failed while converting cluster_6[_XlaCompiledKernel=true,_XlaHasReferenceVars=false,_XlaNumConstantArgs=4,_XlaNumResourceArgs=0]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions. [1,1]:Traceback (most recent call last): [1,1]: File "/home/nguser/zhangli28/two_tower/demo.py", line 816, in [1,1]: app.run(main) [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/absl/app.py", line 308, in run [1,1]: _run_main(main, args) [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main [1,1]: sys.exit(main(argv)) [1,1]: ^^^^^^^^^^ [1,1]: File "/home/nguser/zhangli28/two_tower/demo.py", line 804, in main [1,1]: train() [1,1]: File "/home/nguser/zhangli28/two_tower/demo.py", line 704, in train [1,1]: model.fit(dataset, [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler [1,1]: raise e.with_traceback(filtered_tb) from None [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute [1,1]: tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name, [1,1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [1,1]:tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error: [1,1]: [1,1]:Detected at node movie_DenseUnifiedEmbeddingLayer/movie_DenseUnifiedEmbeddingLayer/ScatterNd defined at (most recent call last): [1,1]: File "/home/nguser/zhangli28/two_tower/demo.py", line 816, in [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/absl/app.py", line 308, in run [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main [1,1]: [1,1]: File "/home/nguser/zhangli28/two_tower/demo.py", line 804, in main [1,1]: [1,1]: File "/home/nguser/zhangli28/two_tower/demo.py", line 704, in train [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/engine/training.py", line 1807, in fit [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/engine/training.py", line 1401, in train_function [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/engine/training.py", line 1384, in step_function [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/engine/training.py", line 1373, in run_step [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/engine/training.py", line 1150, in train_step [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/engine/training.py", line 590, in call [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/engine/base_layer.py", line 1149, in call [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler [1,1]: [1,1]: File "/home/nguser/zhangli28/two_tower/demo.py", line 450, in call [1,1]: [1,1]: File "/home/nguser/zhangli28/two_tower/demo.py", line 318, in call [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/engine/base_layer.py", line 1149, in call [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/keras/src/utils/traceback_utils.py", line 96, in error_handler [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/tensorflow_recommenders_addons/dynamic_embedding/python/keras/layers/embedding.py", line 564, in call [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/tensorflow_recommenders_addons/dynamic_embedding/python/ops/shadow_embedding_ops.py", line 312, in embedding_lookup_unique_base [1,1]: [1,1]: File "/home/nguser/miniconda3/envs/dssm/lib/python3.11/site-packages/tensorflow_recommenders_addons/dynamic_embedding/python/ops/shadow_embedding_ops.py", line 441, in alltoall_embedding_lookup [1,1]: [1,1]:Must have updates.shape = indices.shape[:batch_dim] + buffer_shape[num_index_dims:], got updates.shape: [1,32], indices.shape: [2,1], buffer_shape: [1,32], num_index_dims: 1, and batch_dim: 1 [1,1]: [1,1]:Stack trace for op definition: [1,1]:dummy_file_name:10:dummy_function_name [1,1]: [1,1]: [[{{node movie_DenseUnifiedEmbeddingLayer/movie_DenseUnifiedEmbeddingLayer/ScatterNd}}]] [1,1]: tf2xla conversion failed while converting cluster_5[_XlaCompiledKernel=true,_XlaHasReferenceVars=false,_XlaNumConstantArgs=4,_XlaNumResourceArgs=0]. Run with TF_DUMP_GRAPH_PREFIX=/path/to/dump/dir and --vmodule=xla_compiler=2 to obtain a dump of the compiled functions. [1,1]: [[cluster_5_1/xla_compile]] [Op:__inference_train_function_5638]

W-O-W avatar Sep 03 '24 09:09 W-O-W

I tried to replace HvdAllToAllEmbedding by BasicEmbedding,but when I mock same id to lookup embedding from BasicEmbedding and print it by tf.print,they are not same on different workers with training. Dense's kernels are same I printed.I guess grad of HvdAllToAllEmbedding not broadcasted by Horovod.

W-O-W avatar Sep 03 '24 09:09 W-O-W

set os.environ['TF_XLA_FLAGS'] ="" can fix it.

W-O-W avatar Sep 03 '24 10:09 W-O-W

Try xla jit level 1 for now. TFRA with XLA support will be soon available.

MoFHeka avatar Nov 15 '24 23:11 MoFHeka