Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

binary

TensorFlow version

2.13.0

Custom code

No

OS platform and distribution

Linux Ubuntu 20.04.3

Mobile device

Linux Ubuntu 20.04.3

Python version

3.8.10

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

CUDA 11.7, cuDNN 8.6

GPU model and memory

3x NVIDIA GeForce RTX 3090

Current behavior?

When trying to run multiple distributed trainings one after another, one of them fails with an Collective ops is aborted by: ... error.

The reproducer attached to this issue produces the following error:

Collective ops is aborted by: Device /job:localhost/replica:0/task:0/device:GPU:1 is joining a group with size2, but that group has size 3 (group_key=1)
The error could be from a previous operation. Restart your program to reset.
	 [[{{node CollectiveReduceV2}}]] [Op:__inference_train_function_5585]

When run with TF 2.12 there is no such error.

The original code where I have encountered this problem results in

E                                           Collective ops is aborted by: Shape mismatch in the collective instance 100. Op at device /job:localhost/replica:0/task:0/device:GPU:1 expected shape [517169] but another member in the group expected shape [516734]. This is likely due to different input shapes at different members of the collective op.
E                                           The error could be from a previous operation. Restart your program to reset.
E                                           	 [[{{node CollectiveReduceV2}}]] [Op:__inference_train_function_49105]

but I wasn't able to reproduce this with a small code snippet.

Standalone code to reproduce the issue

import pytest
import tensorflow as tf
import tensorflow_datasets as tfds


@pytest.mark.parametrize("devices", [1, 3, 2])
def test_distributed_fit(devices):
    datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
    mnist_train, mnist_test = datasets['train'], datasets['test']

    if devices == 1:
        strategy = tf.distribute.OneDeviceStrategy("/gpu:0")
    else:
        strategy = tf.distribute.MirroredStrategy([f"/gpu:{i}" for i in range(devices)])

    batch_size = 64 * strategy.num_replicas_in_sync
    train_dataset = mnist_test.cache().shuffle(10000).batch(batch_size)

    with strategy.scope():
        model = tf.keras.Sequential([
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(10)
        ])

        model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                      optimizer=tf.keras.optimizers.Adam(),
                      metrics=['accuracy'])

    model.fit(train_dataset, epochs=1)


if __name__ == '__main__':
    test_distributed_fit(1)
    test_distributed_fit(3)
    test_distributed_fit(2)

Relevant log output

/home/nsavel/venvs/nncf_tf_213/bin/python /home/nsavel/workspace/nncf_tf_213/reproducer.py 
2023-07-18 16:47:21.693862: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-07-18 16:47:21.722428: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:7630] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-07-18 16:47:21.722456: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-07-18 16:47:21.722481: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-07-18 16:47:21.728124: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-18 16:47:22.211027: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
WARNING:tensorflow:From /home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/tensorflow/python/ops/distributions/distribution.py:259: ReparameterizationType.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From /home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/tensorflow/python/ops/distributions/bernoulli.py:165: RegisterKL.__init__ (from tensorflow.python.ops.distributions.kullback_leibler) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
2023-07-18 16:47:24.321508: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1833] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22292 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:17:00.0, compute capability: 8.6
2023-07-18 16:47:24.322042: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1833] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 22292 MB memory:  -> device: 1, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:65:00.0, compute capability: 8.6
2023-07-18 16:47:24.322425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1833] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 22292 MB memory:  -> device: 2, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:b3:00.0, compute capability: 8.6
2023-07-18 16:47:24.602273: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:552] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
2023-07-18 16:47:25.946425: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fcf358b4470 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-07-18 16:47:25.946450: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 3090, Compute Capability 8.6
2023-07-18 16:47:25.946455: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): NVIDIA GeForce RTX 3090, Compute Capability 8.6
2023-07-18 16:47:25.946458: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): NVIDIA GeForce RTX 3090, Compute Capability 8.6
2023-07-18 16:47:25.950178: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-07-18 16:47:26.074588: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:434] Loaded cuDNN version 8600
2023-07-18 16:47:26.171621: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
157/157 [==============================] - 2s 5ms/step - loss: 25.9054 - accuracy: 0.6873
2023-07-18 16:47:27.474184: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:552] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
2023-07-18 16:47:30.690312: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:434] Loaded cuDNN version 8600
2023-07-18 16:47:30.822607: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:434] Loaded cuDNN version 8600
53/53 [==============================] - 3s 7ms/step - loss: 43.9234 - accuracy: 0.5655
2023-07-18 16:47:31.372876: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:552] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
2023-07-18 16:47:32.398894: E tensorflow/core/common_runtime/base_collective_executor.cc:249] BaseCollectiveExecutor::StartAbort INTERNAL: Device /job:localhost/replica:0/task:0/device:GPU:1 is joining a group with size2, but that group has size 3 (group_key=1)
2023-07-18 16:47:32.398950: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 7416489994643074752
2023-07-18 16:47:32.399024: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 1224112818691547746
2023-07-18 16:47:32.399044: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 10338356286700713842
2023-07-18 16:47:32.399063: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 6809993284794892577
2023-07-18 16:47:32.399081: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 12460047264292639245
2023-07-18 16:47:32.399097: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 8051515006773529005
Traceback (most recent call last):
  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 35, in <module>
    test_distributed_fit(2)
  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 29, in test_distributed_fit
    model.fit(train_dataset, epochs=1)
  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 53, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:

Detected at node CollectiveReduceV2 defined at (most recent call last):
  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 35, in <module>
    test_distributed_fit(2)

  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 35, in <module>
    test_distributed_fit(2)

  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 29, in test_distributed_fit
    model.fit(train_dataset, epochs=1)

  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 35, in <module>
    test_distributed_fit(2)

  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 29, in test_distributed_fit
    model.fit(train_dataset, epochs=1)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
    return fn(*args, **kwargs)

  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 35, in <module>
    test_distributed_fit(2)

  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 29, in test_distributed_fit
    model.fit(train_dataset, epochs=1)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
    return fn(*args, **kwargs)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1782, in fit
    tmp_logs = self.train_function(iterator)

  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 35, in <module>
    test_distributed_fit(2)

  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 29, in test_distributed_fit
    model.fit(train_dataset, epochs=1)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
    return fn(*args, **kwargs)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1782, in fit
    tmp_logs = self.train_function(iterator)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1376, in train_function
    return step_function(self, iterator)

  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 35, in <module>
    test_distributed_fit(2)

  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 29, in test_distributed_fit
    model.fit(train_dataset, epochs=1)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
    return fn(*args, **kwargs)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1782, in fit
    tmp_logs = self.train_function(iterator)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1376, in train_function
    return step_function(self, iterator)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1359, in step_function
    outputs = model.distribute_strategy.run(run_step, args=(data,))

  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 35, in <module>
    test_distributed_fit(2)

  File "/home/nsavel/workspace/nncf_tf_213/reproducer.py", line 29, in test_distributed_fit
    model.fit(train_dataset, epochs=1)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/utils/traceback_utils.py", line 65, in error_handler
    return fn(*args, **kwargs)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1782, in fit
    tmp_logs = self.train_function(iterator)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1376, in train_function
    return step_function(self, iterator)

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/engine/training.py", line 1359, in step_function
    outputs = model.distribute_strategy.run(run_step, args=(data,))

  File "/home/nsavel/venvs/nncf_tf_213/lib/python3.8/site-packages/keras/src/optimizers/utils.py", line 175, in _all_reduce_sum_fn
    return distribution.extended.batch_reduce_to(

Collective ops is aborted by: Device /job:localhost/replica:0/task:0/device:GPU:1 is joining a group with size2, but that group has size 3 (group_key=1)
The error could be from a previous operation. Restart your program to reset.
	 [[{{node CollectiveReduceV2}}]] [Op:__inference_train_function_5585]

Process finished with exit code 1

Jul 18 '23 14:07 nikita-savelyevv

I have the same issue source binary tensorflow 2.13.0 Python 3.9.1 CUDA 12.1.1-1 CUDnn 8.9.1.23 GPU 3x NVIDIA V100

Jul 19 '23 08:07 amendl

Hi @nikita-savelyevv ,

From your attached code snippet i have changed this line: train_dataset = mnist_test.cache().shuffle(10000).batch(batch_size) to : train_dataset = mnist_train.cache().shuffle(10000).batch(batch_size) and then executed the code on a GCP VM with 4 GPUs. The logs are attached below.

(bazel) suryanarayanay@surya-ubuntu20:~$ python 61314_r3.py
2023-07-19 08:52:48.833519: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:8893] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-07-19 08:52:48.833729: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-07-19 08:52:48.837974: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-07-19 08:52:49.192706: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-19 08:52:50.862937: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
WARNING:tensorflow:From /home/suryanarayanay/miniconda3/envs/bazel/lib/python3.9/site-packages/tensorflow/python/ops/distributions/distribution.py:259: ReparameterizationType.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From /home/suryanarayanay/miniconda3/envs/bazel/lib/python3.9/site-packages/tensorflow/python/ops/distributions/bernoulli.py:165: RegisterKL.__init__ (from tensorflow.python.ops.distributions.kullback_leibler) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
2023-07-19 08:53:01.959096: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:01.961067: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:01.962833: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:01.964631: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:02.295809: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:02.297743: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:02.299515: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:02.301298: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:02.303146: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:02.304774: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:02.306338: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:02.307986: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:03.493607: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:03.495717: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:03.497548: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:03.499358: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:03.501174: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:03.502817: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:03.504433: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:03.506064: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:03.507816: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:03.509473: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:03.511157: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:03.512805: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.737342: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.739414: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.741220: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.743084: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.744986: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.746639: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.748217: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.749874: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.751576: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.753158: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1831] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13623 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
2023-07-19 08:53:06.753544: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.755104: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1831] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 13623 MB memory:  -> device: 1, name: Tesla T4, pci bus id: 0000:00:05.0, compute capability: 7.5
2023-07-19 08:53:06.755482: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.757076: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1831] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 13623 MB memory:  -> device: 2, name: Tesla T4, pci bus id: 0000:00:06.0, compute capability: 7.5
2023-07-19 08:53:06.757437: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 08:53:06.759072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1831] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 13623 MB memory:  -> device: 3, name: Tesla T4, pci bus id: 0000:00:07.0, compute capability: 7.5
Devices: 1
2023-07-19 08:53:07.299239: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:552] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
2023-07-19 08:53:10.842723: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3a569dd0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-07-19 08:53:10.842919: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2023-07-19 08:53:10.842983: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): Tesla T4, Compute Capability 7.5
2023-07-19 08:53:10.843030: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): Tesla T4, Compute Capability 7.5
2023-07-19 08:53:10.843085: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (3): Tesla T4, Compute Capability 7.5
2023-07-19 08:53:10.944601: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-07-19 08:53:11.479846: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:440] Loaded cuDNN version 8600
2023-07-19 08:53:11.729654: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-07-19 08:53:11.981977: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
938/938 [==============================] - 10s 6ms/step - loss: 10.1223 - accuracy: 0.8286 
Devices: 2
2023-07-19 08:53:19.515161: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:552] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
2023-07-19 08:53:47.328426: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 2 of 10000
2023-07-19 08:53:52.679706: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 3 of 10000
2023-07-19 08:54:08.411572: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 5 of 10000
2023-07-19 08:54:20.882652: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 6 of 10000
2023-07-19 08:54:34.713750: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 7 of 10000
2023-07-19 08:54:54.214257: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 8 of 10000
2023-07-19 08:55:09.061208: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 9 of 10000
2023-07-19 08:55:14.850822: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 10 of 10000
2023-07-19 08:55:20.150592: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 11 of 10000
2023-07-19 08:55:25.772313: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 12 of 10000
2023-07-19 08:55:31.710363: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 13 of 10000
2023-07-19 08:55:38.318281: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 14 of 10000
2023-07-19 08:55:44.011665: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 15 of 10000
2023-07-19 08:55:54.076764: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 17 of 10000
2023-07-19 08:56:07.181624: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 19 of 10000
2023-07-19 08:56:14.434589: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 20 of 10000
2023-07-19 08:56:29.020860: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 22 of 10000
2023-07-19 08:56:35.686768: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 23 of 10000
2023-07-19 08:56:47.350548: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 25 of 10000
2023-07-19 08:56:53.206492: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 26 of 10000
2023-07-19 08:57:03.189459: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 28 of 10000
2023-07-19 08:57:13.485880: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 30 of 10000
2023-07-19 08:57:24.014117: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 32 of 10000
2023-07-19 08:57:38.654104: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 34 of 10000
2023-07-19 08:57:45.836962: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 35 of 10000
2023-07-19 08:57:53.215866: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 36 of 10000
2023-07-19 08:58:03.464771: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 38 of 10000
2023-07-19 08:58:18.851690: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 39 of 10000
2023-07-19 08:58:37.214024: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 40 of 10000
2023-07-19 08:58:55.288479: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 41 of 10000
2023-07-19 08:59:06.636724: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 42 of 10000
2023-07-19 08:59:21.521155: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 43 of 10000
2023-07-19 08:59:34.989589: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 44 of 10000
2023-07-19 08:59:48.707579: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 45 of 10000
2023-07-19 09:00:05.466870: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 46 of 10000
2023-07-19 09:00:22.107982: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 47 of 10000
2023-07-19 09:00:37.668419: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 48 of 10000
2023-07-19 09:00:51.022583: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 49 of 10000
2023-07-19 09:01:04.442588: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 50 of 10000
2023-07-19 09:01:41.646110: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 51 of 10000
2023-07-19 09:02:27.555847: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 52 of 10000
2023-07-19 09:03:03.864518: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:422] Filling up shuffle buffer (this may take a while): 53 of 10000

When devices=1 the code executes fine for me. When devices>=2 for filling shuffle buffer its taking very much time. This should not be the case.There seems some problem wrt performance but not sure whether your reported behaviour able to replicable or not since I have stopped as the code taking too much time. The code used is attached as gist here.

Jul 19 '23 09:07 SuryanarayanaY

I have also tested the same code snippet attached by @nikita-savelyevv and found program hangs when devices=2 started executing. Logas attached below.

(bazel) suryanarayanay@surya-ubuntu20:~$ python 61314_r2.py
2023-07-19 09:15:25.741329: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:8893] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-07-19 09:15:25.741552: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-07-19 09:15:25.746410: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-07-19 09:15:26.157582: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-19 09:15:27.903770: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
WARNING:tensorflow:From /home/suryanarayanay/miniconda3/envs/bazel/lib/python3.9/site-packages/tensorflow/python/ops/distributions/distribution.py:259: ReparameterizationType.__init__ (from tensorflow.python.ops.distributions.distribution) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
WARNING:tensorflow:From /home/suryanarayanay/miniconda3/envs/bazel/lib/python3.9/site-packages/tensorflow/python/ops/distributions/bernoulli.py:165: RegisterKL.__init__ (from tensorflow.python.ops.distributions.kullback_leibler) is deprecated and will be removed after 2019-01-01.
Instructions for updating:
The TensorFlow Distributions library has moved to TensorFlow Probability (https://github.com/tensorflow/probability). You should update all references to use `tfp.distributions` instead of `tf.distributions`.
2023-07-19 09:15:38.960161: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:38.962045: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:38.963767: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:38.965520: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:39.264719: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:39.266613: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:39.268353: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:39.270109: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:39.271916: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:39.273529: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:39.275073: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:39.276628: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:40.428372: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:40.430244: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:40.432022: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:40.433705: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:40.435508: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:40.437049: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:40.438592: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:40.440161: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:40.441733: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:40.443276: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:40.444815: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:40.446383: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.793004: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.794994: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.796831: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.798663: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.800458: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.802054: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.803632: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.805221: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.806810: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.808346: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1831] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13623 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
2023-07-19 09:15:43.808735: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.810249: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1831] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 13623 MB memory:  -> device: 1, name: Tesla T4, pci bus id: 0000:00:05.0, compute capability: 7.5
2023-07-19 09:15:43.810625: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.812202: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1831] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 13623 MB memory:  -> device: 2, name: Tesla T4, pci bus id: 0000:00:06.0, compute capability: 7.5
2023-07-19 09:15:43.812576: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-19 09:15:43.814108: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1831] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 13623 MB memory:  -> device: 3, name: Tesla T4, pci bus id: 0000:00:07.0, compute capability: 7.5
Devices: 1
2023-07-19 09:15:44.326961: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:552] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
2023-07-19 09:15:47.663646: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3a7b33e0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-07-19 09:15:47.663778: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2023-07-19 09:15:47.663798: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): Tesla T4, Compute Capability 7.5
2023-07-19 09:15:47.663810: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): Tesla T4, Compute Capability 7.5
2023-07-19 09:15:47.663821: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (3): Tesla T4, Compute Capability 7.5
2023-07-19 09:15:47.778929: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-07-19 09:15:48.326253: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:440] Loaded cuDNN version 8600
2023-07-19 09:15:48.567179: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-07-19 09:15:48.771324: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
157/157 [==============================] - 5s 2ms/step - loss: 26.1179 - accuracy: 0.6900
Devices: 2
2023-07-19 09:15:50.748191: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:552] The `assert_cardinality` transformation is currently not handled by the auto-shard rewrite and will be removed.
##Hangs here

Tested code attached here as colab gist.

Thanks!

Jul 19 '23 09:07 SuryanarayanaY

@SuryanarayanaY Thanks for reaching out! I used mnist_test intentionally to slightly speed up the reproduction.

I agree with your results. For me, when order of devices is set to 1, 2, 3, the case devices=2 also hangs as you describe. For the order 1, 3, 2, the case devices=2 produces the error I've attached in the ticket.

Since the machine you run the code on has 4 GPUs, I would suppose that setting the order to something like 1, 4, 3, 2 would also lead to the error I attached.

Anyway, I would assume that these two problems (hanging and throwing error) are related and may have the same cause.

Jul 19 '23 09:07 nikita-savelyevv

Adding to distributed training hanging with tensorflow==2.13.1

Small fashion mnist example to reproduce jit_compiled model fails to train and hangs:

import tensorflow as tf
from keras import Model
from keras.layers import Dense, Dropout, Flatten, Input
from keras.utils import set_random_seed


def get_model() -> Model:
    set_random_seed(42)
    inp = Input(shape=(28, 28))
    inp = tf.expand_dims(inp, axis=-1)
    flt = Flatten()(inp)
    hdn = Dense(32, activation="relu")(flt)
    drp = Dropout(0.2)(hdn)
    out = Dense(10)(drp)
    model = Model(inputs=inp, outputs=out, name="mnist_model")
    print(model.summary())

    return model


def main():
    strategy = tf.distribute.MirroredStrategy()
    print("Number of devices: {}".format(strategy.num_replicas_in_sync))
    assert (
        strategy.num_replicas_in_sync > 1
    ), "strategy.num_replicas_in_sync must be greater than 1 or else problem will not be shown"
    with strategy.scope():
        model = get_model()
        model.compile(
            optimizer="adam",
            loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
            metrics=["accuracy"],
            jit_compile=True,  # FIXME: jit compiled model will fail to hang during fit
        )

    fashion_mnist = tf.keras.datasets.fashion_mnist
    (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()

    train_images = train_images / 255.0
    test_images = test_images / 255.0
    model.fit(train_images, train_labels, epochs=5, batch_size=1024)


if __name__ == "__main__":
    main()

2023-07-25 09:31:22.920814: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-25 09:31:22.922395: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13576 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
2023-07-25 09:31:22.923073: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-07-25 09:31:22.924679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 13576 MB memory:  -> device: 1, name: Tesla T4, pci bus id: 0000:00:05.0, compute capability: 7.5
Number of devices: 2
Model: "mnist_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_2 (InputLayer)        [(None, 28, 28, 1)]       0         
                                                                 
 flatten (Flatten)           (None, 784)               0         
                                                                 
 dense (Dense)               (None, 32)                25120     
                                                                 
 dropout (Dropout)           (None, 32)                0         
                                                                 
 dense_1 (Dense)             (None, 10)                330       
                                                                 
=================================================================
Total params: 25450 (99.41 KB)
Trainable params: 25450 (99.41 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
None
Epoch 1/5
2023-07-25 09:31:25.593085: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x296a66b0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-07-25 09:31:25.593171: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2023-07-25 09:31:25.593209: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): Tesla T4, Compute Capability 7.5
2023-07-25 09:31:25.641984: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:255] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2023-07-25 09:31:25.662430: W tensorflow/compiler/tf2xla/kernels/random_ops.cc:57] Warning: Using tf.random.uniform with XLA compilation will ignore seeds; consider using tf.random.stateless_uniform instead if reproducible behavior is desired. mnist_model/dropout/dropout/random_uniform/RandomUniform
2023-07-25 09:31:26.958144: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8600
2023-07-25 09:31:26.984048: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-07-25 09:31:27.345626: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8600
2023-07-25 09:31:28.252749: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
......hang here and no more log lines

Jul 25 '23 09:07 rivershah

Same issue here! Using 4GPU for distributed training on Ubuntu 22.04 with Tensorflow 2.13 hangs at the “compiled cluster using the XLA” line. Issue solved by downgrading to 2.12

Aug 15 '23 05:08 xinyu-dev

Same for me with RTX 8000 and A6000 setup in MirrorStrategy with NCCL and Hierarchical CrossDeviceOps. I get a huge block of tensorflow/core/framework/local_rendezvous.cc:405 Local rendezvous recv item cancelled. before first epoch, but it doesn't hang up for me and I can successfully train the model. I was a bit too excited with 2.13 fixing the "placeholder tensor" warning from 2.12.

Would be nice to get some feedback from the team if reproducible.

Aug 28 '23 13:08 niemiaszek

Device /job:localhost/replica:0/task:0/device:GPU:1 is joining a group with size 2, but that group has size 3 (group_key=1) means that when you are running the function with 2 GPUs, the collective op from previous function call with 3 GPUs might still be pending.

Generally it's not a good idea to create multiple tf.dist.Strategy in sequence in a production job, as they will share the same collective key and is very likely to cause arbitrary collapse between multiple all-reduces. For this case, try to reset context at the beginning of each test case. Example: https://github.com/tensorflow/tensorflow/blob/2a7efd891d3b16ef82b462d76fd9e61d111bf901/tensorflow/python/distribute/mirrored_strategy_test.py#L355

Aug 30 '23 18:08 hubingallin

I am facing similar issue described by @xinyu-dev. Ubuntu 22.04 and Tensorflow 2.13.0, but running from docker image using gcr.io/deeplearning-platform-release/tf2-gpu.2-13.py310:m111 as a base, on Vertex AI, with 4 x T4 GPUs. I trained with mirror strategy, which defaults to NCCLAllReduce. The training hangs with 100% GPU and memory utilization. I turned on NCCL_DEBUG=INFO, and here is what I have in my logs:

INFO 2023-09-13T15:46:26.702940522Z [resource.labels.taskName: workerpool0-0] NCCL INFO Bootstrap : Using eth0:10.128.0.57<0>
INFO 2023-09-13T15:46:26.702990425Z [resource.labels.taskName: workerpool0-0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
INFO 2023-09-13T15:46:26.703005755Z [resource.labels.taskName: workerpool0-0] NCCL INFO NET/Plugin: Loaded net plugin FastSocket (v4)
INFO 2023-09-13T15:46:26.703019842Z [resource.labels.taskName: workerpool0-0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
INFO 2023-09-13T15:46:26.703028102Z [resource.labels.taskName: workerpool0-0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin symbol (v4 or v5).
INFO 2023-09-13T15:46:26.703034293Z [resource.labels.taskName: workerpool0-0] NCCL INFO cudaDriverVersion 11040
INFO 2023-09-13T15:46:26.703040462Z [resource.labels.taskName: workerpool0-0] NCCL version 2.13.4+cudaCUDA_MAJOR.CUDA_MINOR
INFO 2023-09-13T15:46:26.893955808Z [resource.labels.taskName: workerpool0-0] NCCL INFO NET/FastSocket : Tx CPU start: -2
INFO 2023-09-13T15:46:26.894000335Z [resource.labels.taskName: workerpool0-0] NCCL INFO NET/FastSocket : Rx CPU start: -2
INFO 2023-09-13T15:46:26.894009348Z [resource.labels.taskName: workerpool0-0] NCCL INFO NET/FastSocket : Flow placement enabled.
INFO 2023-09-13T15:46:26.894015649Z [resource.labels.taskName: workerpool0-0] NCCL INFO NET/FastSocket : queue skip: 0
INFO 2023-09-13T15:46:26.894021171Z [resource.labels.taskName: workerpool0-0] NCCL INFO NET/FastSocket : Using [0]eth0:10.128.0.57<0> [1]veth46fef7a:fe80::ec4e:4ff:fe68:e6aa%veth46fef7a<0>
INFO 2023-09-13T15:46:26.894045883Z [resource.labels.taskName: workerpool0-0] NCCL INFO NET/FastSocket plugin initialized
INFO 2023-09-13T15:46:26.894052051Z [resource.labels.taskName: workerpool0-0] NCCL INFO Using network FastSocket
INFO 2023-09-13T15:46:26.894058304Z [resource.labels.taskName: workerpool0-0] NCCL INFO Using network FastSocket
INFO 2023-09-13T15:46:26.894064036Z [resource.labels.taskName: workerpool0-0] NCCL INFO Using network FastSocket
INFO 2023-09-13T15:46:26.894069715Z [resource.labels.taskName: workerpool0-0] NCCL INFO Using network FastSocket
INFO 2023-09-13T15:46:26.894075386Z [resource.labels.taskName: workerpool0-0] NCCL INFO PXN Disabled as plugin is v4
INFO 2023-09-13T15:46:26.894079843Z [resource.labels.taskName: workerpool0-0] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0
INFO 2023-09-13T15:46:26.894084842Z [resource.labels.taskName: workerpool0-0] NCCL INFO P2P Chunksize set to 131072
INFO 2023-09-13T15:46:26.894089473Z [resource.labels.taskName: workerpool0-0] NCCL INFO Channel 00/02 : 0 1 2 3
INFO 2023-09-13T15:46:26.894094426Z [resource.labels.taskName: workerpool0-0] NCCL INFO Channel 01/02 : 0 1 2 3
INFO 2023-09-13T15:46:26.894101796Z [resource.labels.taskName: workerpool0-0] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1
INFO 2023-09-13T15:46:26.894108081Z [resource.labels.taskName: workerpool0-0] NCCL INFO P2P Chunksize set to 131072
INFO 2023-09-13T15:46:26.894112965Z [resource.labels.taskName: workerpool0-0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
INFO 2023-09-13T15:46:26.894118387Z [resource.labels.taskName: workerpool0-0] NCCL INFO P2P Chunksize set to 131072
INFO 2023-09-13T15:46:26.894124008Z [resource.labels.taskName: workerpool0-0] NCCL INFO Trees [0] -1/-1/-1->3->2 [1] -1/-1/-1->3->2
INFO 2023-09-13T15:46:26.894129976Z [resource.labels.taskName: workerpool0-0] NCCL INFO P2P Chunksize set to 131072

Sep 14 '23 16:09 EdwardCuiPeacock

Same issue here! Using 4GPU for distributed training on Ubuntu 22.04 with Tensorflow 2.13 hangs at the “compiled cluster using the XLA” line. Issue solved by downgrading to 2.12

You may refer to my issue in #62234 .

Currently, using RING instead of NCCL is a temporary workaround (https://github.com/edwardyehuang/iSeg/blob/master/utils/distribution_utils.py).

Another workaround (2.13 only atm), use conda install -c conda-forge tensorflow-gpu, instead docker or pip.

Besides, if anyone has a tf-nightly GPU wheel on Mar 17 and April 27, please share it with me so I can test it and see if the pull request https://github.com/tensorflow/tensorflow/pull/60001 or https://github.com/tensorflow/tensorflow/pull/59424 cause this issue.

Oct 27 '23 06:10 edwardyehuang

Same issue here! Using 4GPU for distributed training on Ubuntu 22.04 with Tensorflow 2.13 hangs at the “compiled cluster using the XLA” line. Issue solved by downgrading to 2.12

You may refer to my issue in #62234 .

Currently, using RING instead of NCCL is a temporary workaround (https://github.com/edwardyehuang/iSeg/blob/master/utils/distribution_utils.py).

Another workaround (2.13 only atm), use conda install -c conda-forge tensorflow-gpu, instead docker or pip.

Besides, if anyone has a tf-nightly GPU wheel on Mar 17 and April 27, please share it with me so I can test it and see if the pull request #60001 or #59424 cause this issue.

Another thing worth to attention: Why the third-party (conda-forge) conda build can avoid this issue? (the TensorFlow in the docker image is directly installed from pip)

Oct 27 '23 07:10 edwardyehuang

same issue here. anyone finds solutions?

Oct 27 '23 14:10 baridxiai

Upgrade the NVIDIA driver >= 545 and the issue should be addressed

Mar 14 '24 10:03 edwardyehuang

I got a similar error on the NVIDIA GPU while using tensorflow-federated and nest_asyncio packages. The error appeared when I updated tensorflow-federated package version from 0.38.0 to 0.73.0. I tried updating nest_asyncio, but it didn't help. So I just muted that message.

import os os.environ["TF_CPP_MIN_LOG_LEVEL"] = "1"

Check for details.

Apr 03 '24 07:04 DeepaliKushwaha

Similar error. Fixed it by removing the steps_per_epoch argument from model.fit() and model.evaluate()

import sys from matplotlib import pyplot from keras.utils import to_categorical from keras.models import Sequential from keras.layers import Conv2D from keras.layers import MaxPooling2D from keras.layers import Dense from keras.layers import Flatten from keras.optimizers import SGD from tensorflow.keras.preprocessing.image import ImageDataGenerator import tensorflow as tf import numpy as np

physical_devices = tf.config.list_physical_devices('GPU') try: tf.config.experimental.set_memory_growth(physical_devices[0], True) except:

Invalid device or cannot modify virtual devices once initialized.

pass

define cnn model

def define_model(): model = Sequential() model.add(Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same', input_shape=(200, 200, 3))) model.add(MaxPooling2D((2, 2))) model.add(Flatten()) model.add(Dense(128, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(1, activation='sigmoid'))

compile model

opt = SGD(learning_rate=0.001, momentum=0.9) model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy']) return model

create data generator

datagen = ImageDataGenerator(rescale=1.0/255.0) model = define_model()

prepare iterators

train_it = datagen.flow_from_directory('/workspace/workspace/cats_and_dogs_data/dogs-vs-cats/train/', class_mode='binary', batch_size=64, target_size=(200, 200)) test_it = datagen.flow_from_directory('/workspace/workspace/cats_and_dogs_data/dogs-vs-cats/test1/', class_mode='binary', batch_size=64, target_size=(200, 200))

fit model

history = model.fit(train_it, validation_data=test_it, epochs=20, verbose=1)

evaluate model

_, acc = model.evaluate(test_it, verbose=1) print('> %.3f' % (acc * 100.0))

Apr 22 '24 21:04 Bchi1994

tensorflow
tensorflow copied to clipboard

TensorFlow 2.13 distributed training fail

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

Invalid device or cannot modify virtual devices once initialized.

define cnn model

compile model

create data generator

prepare iterators

fit model

evaluate model

tensorflow tensorflow copied to clipboard

TensorFlow 2.13 distributed training fail

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

Invalid device or cannot modify virtual devices once initialized.

define cnn model

compile model

create data generator

prepare iterators

fit model

evaluate model

tensorflow
tensorflow copied to clipboard