tensorflow
tensorflow copied to clipboard
TensorFlow 2.13 distributed training fail
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
Yes
Source
binary
TensorFlow version
2.13.0
Custom code
No
OS platform and distribution
Linux Ubuntu 20.04.3
Mobile device
Linux Ubuntu 20.04.3
Python version
3.8.10
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
CUDA 11.7, cuDNN 8.6
GPU model and memory
3x NVIDIA GeForce RTX 3090
Current behavior?
When trying to run multiple distributed trainings one after another, one of them fails with an Collective ops is aborted by: ...
error.
The reproducer attached to this issue produces the following error:
Collective ops is aborted by: Device /job:localhost/replica:0/task:0/device:GPU:1 is joining a group with size2, but that group has size 3 (group_key=1)
The error could be from a previous operation. Restart your program to reset.
[[{{node CollectiveReduceV2}}]] [Op:__inference_train_function_5585]
When run with TF 2.12 there is no such error.
The original code where I have encountered this problem results in
E Collective ops is aborted by: Shape mismatch in the collective instance 100. Op at device /job:localhost/replica:0/task:0/device:GPU:1 expected shape [517169] but another member in the group expected shape [516734]. This is likely due to different input shapes at different members of the collective op.
E The error could be from a previous operation. Restart your program to reset.
E [[{{node CollectiveReduceV2}}]] [Op:__inference_train_function_49105]
but I wasn't able to reproduce this with a small code snippet.
Standalone code to reproduce the issue
import pytest
import tensorflow as tf
import tensorflow_datasets as tfds
@pytest.mark.parametrize("devices", [1, 3, 2])
def test_distributed_fit(devices):
datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
mnist_train, mnist_test = datasets['train'], datasets['test']
if devices == 1:
strategy = tf.distribute.OneDeviceStrategy("/gpu:0")
else:
strategy = tf.distribute.MirroredStrategy([f"/gpu:{i}" for i in range(devices)])
batch_size = 64 * strategy.num_replicas_in_sync
train_dataset = mnist_test.cache().shuffle(10000).batch(batch_size)
with strategy.scope():
model = tf.keras.Sequential([
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(10)
])
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
optimizer=tf.keras.optimizers.Adam(),
metrics=['accuracy'])
model.fit(train_dataset, epochs=1)
if __name__ == '__main__':
test_distributed_fit(1)
test_distributed_fit(3)
test_distributed_fit(2)
Relevant log output
/home/nsavel/venvs/nncf_tf_213/bin/python /home/nsavel/workspace/nncf_tf_213/reproducer.py
2023-07-18 16:47:21.693862: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-07-18 16:47:21.722428: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:7630] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-07-18 16:47:21.722456: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-07-18 16:47:21.722481: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-07-18 16:47:21.728124: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-18 16:47:22.211027: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
WARNING:tensorflow:From /home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/tensorflow/python/ops/distributions/distribution.py:259: ReparameterizationType.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From /home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/tensorflow/python/ops/distributions/bernoulli.py:165: RegisterKL.__init__ (from tensorflow.python.ops.distributions.kullback_leibler) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
2023-07-18 16:47:24.321508: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1833] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22292 MB memory: -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:17:00.0, compute capability: 8.6
2023-07-18 16:47:24.322042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1833] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 22292 MB memory: -> device: 1, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:65:00.0, compute capability: 8.6
2023-07-18 16:47:24.322425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1833] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 22292 MB memory: -> device: 2, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:b3:00.0, compute capability: 8.6
2023-07-18 16:47:24.602273: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:552] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
2023-07-18 16:47:25.946425: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fcf358b4470 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-07-18 16:47:25.946450: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA GeForce RTX 3090, Compute Capability 8.6
2023-07-18 16:47:25.946455: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): NVIDIA GeForce RTX 3090, Compute Capability 8.6
2023-07-18 16:47:25.946458: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (2): NVIDIA GeForce RTX 3090, Compute Capability 8.6
2023-07-18 16:47:25.950178: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-07-18 16:47:26.074588: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:434] Loaded cuDNN version 8600
2023-07-18 16:47:26.171621: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
157/157 [==============================] - 2s 5ms/step - loss: 25.9054 - accuracy: 0.6873
2023-07-18 16:47:27.474184: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:552] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
2023-07-18 16:47:30.690312: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:434] Loaded cuDNN version 8600
2023-07-18 16:47:30.822607: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:434] Loaded cuDNN version 8600
53/53 [==============================] - 3s 7ms/step - loss: 43.9234 - accuracy: 0.5655
2023-07-18 16:47:31.372876: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:552] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
2023-07-18 16:47:32.398894: E tensorflow/core/common_runtime/base_collective_executor.cc:249] BaseCollectiveExecutor::StartAbort INTERNAL: Device /job:localhost/replica:0/task:0/device:GPU:1 is joining a group with size2, but that group has size 3 (group_key=1)
2023-07-18 16:47:32.398950: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 7416489994643074752
2023-07-18 16:47:32.399024: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 1224112818691547746
2023-07-18 16:47:32.399044: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 10338356286700713842
2023-07-18 16:47:32.399063: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 6809993284794892577
2023-07-18 16:47:32.399081: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 12460047264292639245
2023-07-18 16:47:32.399097: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 8051515006773529005
Traceback (most recent call last):
File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 35, in <module>
test_distributed_fit(2)
File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 29, in test_distributed_fit
model.fit(train_dataset, epochs=1)
File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:
Detected at node CollectiveReduceV2 defined at (most recent call last):
File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 35, in <module>
test_distributed_fit(2)
File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 35, in <module>
test_distributed_fit(2)
File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 29, in test_distributed_fit
model.fit(train_dataset, epochs=1)
File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 35, in <module>
test_distributed_fit(2)
File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 29, in test_distributed_fit
model.fit(train_dataset, epochs=1)
File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 35, in <module>
test_distributed_fit(2)
File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 29, in test_distributed_fit
model.fit(train_dataset, epochs=1)
File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1782, in fit
tmp_logs = self.train_function(iterator)
File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 35, in <module>
test_distributed_fit(2)
File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 29, in test_distributed_fit
model.fit(train_dataset, epochs=1)
File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1782, in fit
tmp_logs = self.train_function(iterator)
File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1376, in train_function
return step_function(self, iterator)
File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 35, in <module>
test_distributed_fit(2)
File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 29, in test_distributed_fit
model.fit(train_dataset, epochs=1)
File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1782, in fit
tmp_logs = self.train_function(iterator)
File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1376, in train_function
return step_function(self, iterator)
File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1359, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 35, in <module>
test_distributed_fit(2)
File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 29, in test_distributed_fit
model.fit(train_dataset, epochs=1)
File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
return fn(*args, **kwargs)
File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1782, in fit
tmp_logs = self.train_function(iterator)
File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1376, in train_function
return step_function(self, iterator)
File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1359, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/optimizers/utils.py", line 175, in _all_reduce_sum_fn
return distribution.extended.batch_reduce_to(
Collective ops is aborted by: Device /job:localhost/replica:0/task:0/device:GPU:1 is joining a group with size2, but that group has size 3 (group_key=1)
The error could be from a previous operation. Restart your program to reset.
[[{{node CollectiveReduceV2}}]] [Op:__inference_train_function_5585]
Process finished with exit code 1
I have the same issue source binary tensorflow 2.13.0 Python 3.9.1 CUDA 12.1.1-1 CUDnn 8.9.1.23 GPU 3x NVIDIA V100
Hi @nikita-savelyevv ,
From your attached code snippet i have changed this line:
train_dataset = mnist_test.cache().shuffle(10000).batch(batch_size)
to :
train_dataset = mnist_train.cache().shuffle(10000).batch(batch_size)
and then executed the code on a GCP VM with 4 GPUs. The logs are attached below.
(bazel) suryanarayanay@surya-ubuntu20:~$ python 61314_r3.py
2023-07-19 08:52:48.833519: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:8893] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-07-19 08:52:48.833729: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-07-19 08:52:48.837974: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-07-19 08:52:49.192706: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-19 08:52:50.862937: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
WARNING:tensorflow:From /home/suryanarayanay/miniconda3/envs/bazel/lib/python3.9/site-packages/tensorflow/python/ops/distributions/distribution.py:259: ReparameterizationType.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From /home/suryanarayanay/miniconda3/envs/bazel/lib/python3.9/site-packages/tensorflow/python/ops/distributions/bernoulli.py:165: RegisterKL.__init__ (from tensorflow.python.ops.distributions.kullback_leibler) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
2023-07-19 08:53:01.959096: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:01.961067: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:01.962833: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:01.964631: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:02.295809: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:02.297743: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:02.299515: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:02.301298: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:02.303146: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:02.304774: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:02.306338: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:02.307986: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:03.493607: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:03.495717: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:03.497548: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:03.499358: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:03.501174: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:03.502817: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:03.504433: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:03.506064: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:03.507816: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:03.509473: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:03.511157: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:03.512805: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.737342: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.739414: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.741220: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.743084: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.744986: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.746639: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.748217: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.749874: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.751576: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.753158: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1831] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13623 MB memory: -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
2023-07-19 08:53:06.753544: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.755104: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1831] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 13623 MB memory: -> device: 1, name: Tesla T4, pci bus id: 0000:00:05.0, compute capability: 7.5
2023-07-19 08:53:06.755482: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.757076: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1831] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 13623 MB memory: -> device: 2, name: Tesla T4, pci bus id: 0000:00:06.0, compute capability: 7.5
2023-07-19 08:53:06.757437: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.759072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1831] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 13623 MB memory: -> device: 3, name: Tesla T4, pci bus id: 0000:00:07.0, compute capability: 7.5
Devices: 1
2023-07-19 08:53:07.299239: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:552] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
2023-07-19 08:53:10.842723: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3a569dd0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-07-19 08:53:10.842919: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2023-07-19 08:53:10.842983: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): Tesla T4, Compute Capability 7.5
2023-07-19 08:53:10.843030: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (2): Tesla T4, Compute Capability 7.5
2023-07-19 08:53:10.843085: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (3): Tesla T4, Compute Capability 7.5
2023-07-19 08:53:10.944601: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-07-19 08:53:11.479846: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:440] Loaded cuDNN version 8600
2023-07-19 08:53:11.729654: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-07-19 08:53:11.981977: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
938/938 [==============================] - 10s 6ms/step - loss: 10.1223 - accuracy: 0.8286
Devices: 2
2023-07-19 08:53:19.515161: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:552] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
2023-07-19 08:53:47.328426: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 2 of 10000
2023-07-19 08:53:52.679706: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 3 of 10000
2023-07-19 08:54:08.411572: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 5 of 10000
2023-07-19 08:54:20.882652: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 6 of 10000
2023-07-19 08:54:34.713750: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 7 of 10000
2023-07-19 08:54:54.214257: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 8 of 10000
2023-07-19 08:55:09.061208: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 9 of 10000
2023-07-19 08:55:14.850822: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 10 of 10000
2023-07-19 08:55:20.150592: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 11 of 10000
2023-07-19 08:55:25.772313: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 12 of 10000
2023-07-19 08:55:31.710363: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 13 of 10000
2023-07-19 08:55:38.318281: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 14 of 10000
2023-07-19 08:55:44.011665: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 15 of 10000
2023-07-19 08:55:54.076764: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 17 of 10000
2023-07-19 08:56:07.181624: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 19 of 10000
2023-07-19 08:56:14.434589: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 20 of 10000
2023-07-19 08:56:29.020860: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 22 of 10000
2023-07-19 08:56:35.686768: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 23 of 10000
2023-07-19 08:56:47.350548: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 25 of 10000
2023-07-19 08:56:53.206492: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 26 of 10000
2023-07-19 08:57:03.189459: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 28 of 10000
2023-07-19 08:57:13.485880: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 30 of 10000
2023-07-19 08:57:24.014117: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 32 of 10000
2023-07-19 08:57:38.654104: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 34 of 10000
2023-07-19 08:57:45.836962: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 35 of 10000
2023-07-19 08:57:53.215866: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 36 of 10000
2023-07-19 08:58:03.464771: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 38 of 10000
2023-07-19 08:58:18.851690: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 39 of 10000
2023-07-19 08:58:37.214024: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 40 of 10000
2023-07-19 08:58:55.288479: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 41 of 10000
2023-07-19 08:59:06.636724: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 42 of 10000
2023-07-19 08:59:21.521155: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 43 of 10000
2023-07-19 08:59:34.989589: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 44 of 10000
2023-07-19 08:59:48.707579: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 45 of 10000
2023-07-19 09:00:05.466870: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 46 of 10000
2023-07-19 09:00:22.107982: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 47 of 10000
2023-07-19 09:00:37.668419: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 48 of 10000
2023-07-19 09:00:51.022583: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 49 of 10000
2023-07-19 09:01:04.442588: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 50 of 10000
2023-07-19 09:01:41.646110: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 51 of 10000
2023-07-19 09:02:27.555847: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 52 of 10000
2023-07-19 09:03:03.864518: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 53 of 10000
When devices=1
the code executes fine for me. When devices>=2
for filling shuffle buffer its taking very much time. This should not be the case.There seems some problem wrt performance but not sure whether your reported behaviour able to replicable or not since I have stopped as the code taking too much time. The code used is attached as gist here.
I have also tested the same code snippet attached by @nikita-savelyevv and found program hangs when devices=2 started executing. Logas attached below.
(bazel) suryanarayanay@surya-ubuntu20:~$ python 61314_r2.py
2023-07-19 09:15:25.741329: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:8893] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-07-19 09:15:25.741552: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-07-19 09:15:25.746410: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-07-19 09:15:26.157582: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-19 09:15:27.903770: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
WARNING:tensorflow:From /home/suryanarayanay/miniconda3/envs/bazel/lib/python3.9/site-packages/tensorflow/python/ops/distributions/distribution.py:259: ReparameterizationType.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From /home/suryanarayanay/miniconda3/envs/bazel/lib/python3.9/site-packages/tensorflow/python/ops/distributions/bernoulli.py:165: RegisterKL.__init__ (from tensorflow.python.ops.distributions.kullback_leibler) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
2023-07-19 09:15:38.960161: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:38.962045: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:38.963767: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:38.965520: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:39.264719: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:39.266613: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:39.268353: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:39.270109: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:39.271916: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:39.273529: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:39.275073: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:39.276628: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:40.428372: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:40.430244: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:40.432022: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:40.433705: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:40.435508: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:40.437049: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:40.438592: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:40.440161: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:40.441733: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:40.443276: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:40.444815: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:40.446383: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.793004: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.794994: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.796831: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.798663: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.800458: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.802054: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.803632: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.805221: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.806810: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.808346: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1831] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13623 MB memory: -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
2023-07-19 09:15:43.808735: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.810249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1831] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 13623 MB memory: -> device: 1, name: Tesla T4, pci bus id: 0000:00:05.0, compute capability: 7.5
2023-07-19 09:15:43.810625: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.812202: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1831] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 13623 MB memory: -> device: 2, name: Tesla T4, pci bus id: 0000:00:06.0, compute capability: 7.5
2023-07-19 09:15:43.812576: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.814108: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1831] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 13623 MB memory: -> device: 3, name: Tesla T4, pci bus id: 0000:00:07.0, compute capability: 7.5
Devices: 1
2023-07-19 09:15:44.326961: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:552] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
2023-07-19 09:15:47.663646: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3a7b33e0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-07-19 09:15:47.663778: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2023-07-19 09:15:47.663798: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): Tesla T4, Compute Capability 7.5
2023-07-19 09:15:47.663810: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (2): Tesla T4, Compute Capability 7.5
2023-07-19 09:15:47.663821: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (3): Tesla T4, Compute Capability 7.5
2023-07-19 09:15:47.778929: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-07-19 09:15:48.326253: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:440] Loaded cuDNN version 8600
2023-07-19 09:15:48.567179: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-07-19 09:15:48.771324: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
157/157 [==============================] - 5s 2ms/step - loss: 26.1179 - accuracy: 0.6900
Devices: 2
2023-07-19 09:15:50.748191: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:552] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
##Hangs here
Tested code attached here as colab gist.
Thanks!
@SuryanarayanaY Thanks for reaching out! I used mnist_test
intentionally to slightly speed up the reproduction.
I agree with your results. For me, when order of devices is set to 1, 2, 3
, the case devices=2
also hangs as you describe. For the order 1, 3, 2
, the case devices=2
produces the error I've attached in the ticket.
Since the machine you run the code on has 4 GPUs, I would suppose that setting the order to something like 1, 4, 3, 2
would also lead to the error I attached.
Anyway, I would assume that these two problems (hanging and throwing error) are related and may have the same cause.
Adding to distributed training hanging with tensorflow==2.13.1
Small fashion mnist example to reproduce jit_compiled model fails to train and hangs:
import tensorflow as tf
from keras import Model
from keras.layers import Dense, Dropout, Flatten, Input
from keras.utils import set_random_seed
def get_model() -> Model:
set_random_seed(42)
inp = Input(shape=(28, 28))
inp = tf.expand_dims(inp, axis=-1)
flt = Flatten()(inp)
hdn = Dense(32, activation="relu")(flt)
drp = Dropout(0.2)(hdn)
out = Dense(10)(drp)
model = Model(inputs=inp, outputs=out, name="mnist_model")
print(model.summary())
return model
def main():
strategy = tf.distribute.MirroredStrategy()
print("Number of devices: {}".format(strategy.num_replicas_in_sync))
assert (
strategy.num_replicas_in_sync > 1
), "strategy.num_replicas_in_sync must be greater than 1 or else problem will not be shown"
with strategy.scope():
model = get_model()
model.compile(
optimizer="adam",
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=["accuracy"],
jit_compile=True, # FIXME: jit compiled model will fail to hang during fit
)
fashion_mnist = tf.keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
train_images = train_images / 255.0
test_images = test_images / 255.0
model.fit(train_images, train_labels, epochs=5, batch_size=1024)
if __name__ == "__main__":
main()
2023-07-25 09:31:22.920814: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-25 09:31:22.922395: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13576 MB memory: -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
2023-07-25 09:31:22.923073: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-25 09:31:22.924679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 13576 MB memory: -> device: 1, name: Tesla T4, pci bus id: 0000:00:05.0, compute capability: 7.5
Number of devices: 2
Model: "mnist_model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) [(None, 28, 28, 1)] 0
flatten (Flatten) (None, 784) 0
dense (Dense) (None, 32) 25120
dropout (Dropout) (None, 32) 0
dense_1 (Dense) (None, 10) 330
=================================================================
Total params: 25450 (99.41 KB)
Trainable params: 25450 (99.41 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None
Epoch 1/5
2023-07-25 09:31:25.593085: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x296a66b0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-07-25 09:31:25.593171: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2023-07-25 09:31:25.593209: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): Tesla T4, Compute Capability 7.5
2023-07-25 09:31:25.641984: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:255] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-07-25 09:31:25.662430: W tensorflow/compiler/tf2xla/kernels/random_ops.cc:57] Warning: Using tf.random.uniform with XLA compilation will ignore seeds; consider using tf.random.stateless_uniform instead if reproducible behavior is desired. mnist_model/dropout/dropout/random_uniform/RandomUniform
2023-07-25 09:31:26.958144: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8600
2023-07-25 09:31:26.984048: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-07-25 09:31:27.345626: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8600
2023-07-25 09:31:28.252749: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process.
......hang here and no more log lines
Same issue here! Using 4GPU for distributed training on Ubuntu 22.04 with Tensorflow 2.13 hangs at the “compiled cluster using the XLA” line. Issue solved by downgrading to 2.12
Same for me with RTX 8000 and A6000 setup in MirrorStrategy with NCCL and Hierarchical CrossDeviceOps. I get a huge block of tensorflow/core/framework/local_rendezvous.cc:405 Local rendezvous recv item cancelled.
before first epoch, but it doesn't hang up for me and I can successfully train the model. I was a bit too excited with 2.13 fixing the "placeholder tensor" warning from 2.12.
Would be nice to get some feedback from the team if reproducible.
Device /job:localhost/replica:0/task:0/device:GPU:1 is joining a group with size 2, but that group has size 3 (group_key=1)
means that when you are running the function with 2 GPUs, the collective op from previous function call with 3 GPUs might still be pending.
Generally it's not a good idea to create multiple tf.dist.Strategy
in sequence in a production job, as they will share the same collective key and is very likely to cause arbitrary collapse between multiple all-reduces. For this case, try to reset context at the beginning of each test case. Example: https://github.com/tensorflow/tensorflow/blob/2a7efd891d3b16ef82b462d76fd9e61d111bf901/tensorflow/python/distribute/mirrored_strategy_test.py#L355
I am facing similar issue described by @xinyu-dev. Ubuntu 22.04 and Tensorflow 2.13.0, but running from docker image using gcr.io/deeplearning-platform-release/tf2-gpu.2-13.py310:m111
as a base, on Vertex AI, with 4 x T4 GPUs. I trained with mirror strategy, which defaults to NCCLAllReduce. The training hangs with 100% GPU and memory utilization. I turned on NCCL_DEBUG=INFO
, and here is what I have in my logs:
INFO 2023-09-13T15:46:26.702940522Z [resource.labels.taskName: workerpool0-0] NCCL INFO Bootstrap : Using eth0:10.128.0.57<0>
INFO 2023-09-13T15:46:26.702990425Z [resource.labels.taskName: workerpool0-0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
INFO 2023-09-13T15:46:26.703005755Z [resource.labels.taskName: workerpool0-0] NCCL INFO NET/Plugin: Loaded net plugin FastSocket (v4)
INFO 2023-09-13T15:46:26.703019842Z [resource.labels.taskName: workerpool0-0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
INFO 2023-09-13T15:46:26.703028102Z [resource.labels.taskName: workerpool0-0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
INFO 2023-09-13T15:46:26.703034293Z [resource.labels.taskName: workerpool0-0] NCCL INFO cudaDriverVersion 11040
INFO 2023-09-13T15:46:26.703040462Z [resource.labels.taskName: workerpool0-0] NCCL version 2.13.4+cudaCUDA_MAJOR.CUDA_MINOR
INFO 2023-09-13T15:46:26.893955808Z [resource.labels.taskName: workerpool0-0] NCCL INFO NET/FastSocket : Tx CPU start: -2
INFO 2023-09-13T15:46:26.894000335Z [resource.labels.taskName: workerpool0-0] NCCL INFO NET/FastSocket : Rx CPU start: -2
INFO 2023-09-13T15:46:26.894009348Z [resource.labels.taskName: workerpool0-0] NCCL INFO NET/FastSocket : Flow placement enabled.
INFO 2023-09-13T15:46:26.894015649Z [resource.labels.taskName: workerpool0-0] NCCL INFO NET/FastSocket : queue skip: 0
INFO 2023-09-13T15:46:26.894021171Z [resource.labels.taskName: workerpool0-0] NCCL INFO NET/FastSocket : Using [0]eth0:10.128.0.57<0> [1]veth46fef7a:fe80::ec4e:4ff:fe68:e6aa%veth46fef7a<0>
INFO 2023-09-13T15:46:26.894045883Z [resource.labels.taskName: workerpool0-0] NCCL INFO NET/FastSocket plugin initialized
INFO 2023-09-13T15:46:26.894052051Z [resource.labels.taskName: workerpool0-0] NCCL INFO Using network FastSocket
INFO 2023-09-13T15:46:26.894058304Z [resource.labels.taskName: workerpool0-0] NCCL INFO Using network FastSocket
INFO 2023-09-13T15:46:26.894064036Z [resource.labels.taskName: workerpool0-0] NCCL INFO Using network FastSocket
INFO 2023-09-13T15:46:26.894069715Z [resource.labels.taskName: workerpool0-0] NCCL INFO Using network FastSocket
INFO 2023-09-13T15:46:26.894075386Z [resource.labels.taskName: workerpool0-0] NCCL INFO PXN Disabled as plugin is v4
INFO 2023-09-13T15:46:26.894079843Z [resource.labels.taskName: workerpool0-0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
INFO 2023-09-13T15:46:26.894084842Z [resource.labels.taskName: workerpool0-0] NCCL INFO P2P Chunksize set to 131072
INFO 2023-09-13T15:46:26.894089473Z [resource.labels.taskName: workerpool0-0] NCCL INFO Channel 00/02 : 0 1 2 3
INFO 2023-09-13T15:46:26.894094426Z [resource.labels.taskName: workerpool0-0] NCCL INFO Channel 01/02 : 0 1 2 3
INFO 2023-09-13T15:46:26.894101796Z [resource.labels.taskName: workerpool0-0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
INFO 2023-09-13T15:46:26.894108081Z [resource.labels.taskName: workerpool0-0] NCCL INFO P2P Chunksize set to 131072
INFO 2023-09-13T15:46:26.894112965Z [resource.labels.taskName: workerpool0-0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
INFO 2023-09-13T15:46:26.894118387Z [resource.labels.taskName: workerpool0-0] NCCL INFO P2P Chunksize set to 131072
INFO 2023-09-13T15:46:26.894124008Z [resource.labels.taskName: workerpool0-0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
INFO 2023-09-13T15:46:26.894129976Z [resource.labels.taskName: workerpool0-0] NCCL INFO P2P Chunksize set to 131072
Same issue here! Using 4GPU for distributed training on Ubuntu 22.04 with Tensorflow 2.13 hangs at the “compiled cluster using the XLA” line. Issue solved by downgrading to 2.12
You may refer to my issue in #62234 .
Currently, using RING instead of NCCL is a temporary workaround (https://github.com/edwardyehuang/iSeg/blob/master/utils/distribution_utils.py).
Another workaround (2.13 only atm), use conda install -c conda-forge tensorflow-gpu
, instead docker or pip.
Besides, if anyone has a tf-nightly GPU wheel on Mar 17 and April 27, please share it with me so I can test it and see if the pull request https://github.com/tensorflow/tensorflow/pull/60001 or https://github.com/tensorflow/tensorflow/pull/59424 cause this issue.
Same issue here! Using 4GPU for distributed training on Ubuntu 22.04 with Tensorflow 2.13 hangs at the “compiled cluster using the XLA” line. Issue solved by downgrading to 2.12
You may refer to my issue in #62234 .
Currently, using RING instead of NCCL is a temporary workaround (https://github.com/edwardyehuang/iSeg/blob/master/utils/distribution_utils.py).
Another workaround (2.13 only atm), use
conda install -c conda-forge tensorflow-gpu
, instead docker or pip.Besides, if anyone has a tf-nightly GPU wheel on Mar 17 and April 27, please share it with me so I can test it and see if the pull request #60001 or #59424 cause this issue.
Another thing worth to attention: Why the third-party (conda-forge) conda build can avoid this issue? (the TensorFlow in the docker image is directly installed from pip)
same issue here. anyone finds solutions?
Upgrade the NVIDIA driver >= 545 and the issue should be addressed
I got a similar error on the NVIDIA GPU while using tensorflow-federated and nest_asyncio packages. The error appeared when I updated tensorflow-federated package version from 0.38.0 to 0.73.0. I tried updating nest_asyncio, but it didn't help. So I just muted that message.
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "1"
Check for details.
Similar error. Fixed it by removing the steps_per_epoch argument from model.fit() and model.evaluate()
import sys from matplotlib import pyplot from keras.utils import to_categorical from keras.models import Sequential from keras.layers import Conv2D from keras.layers import MaxPooling2D from keras.layers import Dense from keras.layers import Flatten from keras.optimizers import SGD from tensorflow.keras.preprocessing.image import ImageDataGenerator import tensorflow as tf import numpy as np
physical_devices = tf.config.list_physical_devices('GPU') try: tf.config.experimental.set_memory_growth(physical_devices[0], True) except:
Invalid device or cannot modify virtual devices once initialized.
pass
define cnn model
def define_model(): model = Sequential() model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same', input_shape=(200, 200, 3))) model.add(MaxPooling2D((2, 2))) model.add(Flatten()) model.add(Dense(128, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(1, activation='sigmoid'))
compile model
opt = SGD(learning_rate=0.001, momentum=0.9) model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy']) return model
create data generator
datagen = ImageDataGenerator(rescale=1.0/255.0) model = define_model()
prepare iterators
train_it = datagen.flow_from_directory('/workspace/workspace/cats_and_dogs_data/dogs-vs-cats/train/', class_mode='binary', batch_size=64, target_size=(200, 200)) test_it = datagen.flow_from_directory('/workspace/workspace/cats_and_dogs_data/dogs-vs-cats/test1/', class_mode='binary', batch_size=64, target_size=(200, 200))
fit model
history = model.fit(train_it, validation_data=test_it, epochs=20, verbose=1)
evaluate model
_, acc = model.evaluate(test_it, verbose=1) print('> %.3f' % (acc * 100.0))