tensorflow OSError: [Errno 9] Bad file descriptor raised on program exit

trafficstars

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): v2.5.0-rc3-213-ga4dfb8d1a71 2.5.0
Python version: Python 3.8.5
CUDA/cuDNN version: 11.2 / 8.1.0.77-1
GPU model and memory: P100

Describe the current behavior

When using MirroredStrategy as a context manager, Python raises an ignored exception on program exit:

Exception ignored in: <function Pool.__del__ at 0x7f21f942e4c0>
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "/root/miniconda3/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/root/miniconda3/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/root/miniconda3/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/root/miniconda3/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

Describe the expected behavior

Python exits without the aforementioned exception. (In my testing, there is no such exception raised on TensorFlow 2.4.0, so this seems new in TensorFlow 2.5.0.)

Contributing

Do you want to contribute a PR? (yes/no): No

Standalone code to reproduce the issue

import tensorflow


def f():
    strategy = tensorflow.distribute.MirroredStrategy()
    with strategy.scope():
        tensorflow.keras.layers.Conv2D(64, (3, 3), activation="relu", padding="same")(
            tensorflow.keras.layers.Input(shape=(88, 88, 3))
        )


f()

Removing the strategy.scope() causes the program to exit without the ignored exception, as does removing the function definition (i.e., getting rid of def f() and f(), and invoking at the top level).

Jun 28 '21 15:06 charliermarsh

can confirm this in tf2.5.0 from pypi

Jun 29 '21 06:06 SysuJayce

@crm416 ,

Can you please try to execute the code in tf v2.5 and let us know if you are facing same issue? Thanks!

Jun 29 '21 09:06 tilakrayal

@tilakrayal - Yes, this only occurs for me in tf v2.5 (and not in tf v2.3 or tf v2.4).

Jun 29 '21 13:06 charliermarsh

Same issue in tf v2.6. OSError on program exit if strategy.scope() is called within a function.

The following code causes OSError on exit.

import tensorflow as tf

def main():
  strategy = tf.distribute.MirroredStrategy()
  print(f'\nNumber of devices: {strategy.num_replicas_in_sync}\n')
  with strategy.scope():
    model = tf.keras.Sequential([tf.keras.layers.Dense(10)])
    model.compile(
      loss=tf.keras.losses.MSE,
      optimizer=tf.keras.optimizers.Adam(),
      metrics=['accuracy']
    )

  print('\nDONE\n')

if __name__ == '__main__':
  main()

with the following output:

2021-08-27 12:00:25.516889: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-27 12:00:32.832857: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-08-27 12:00:32.832944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9659 MB memory:  -> device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:1c:00.0, compute capability: 7.5
2021-08-27 12:00:32.834864: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-08-27 12:00:32.834898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 9659 MB memory:  -> device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:1d:00.0, compute capability: 7.5

Number of devices: 2

DONE

Exception ignored in: <function Pool.__del__ at 0x7fbecd304040>
Traceback (most recent call last):
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

Whereas the one below is fine

import tensorflow as tf

strategy = tf.distribute.MirroredStrategy()
print(f'\nNumber of devices: {strategy.num_replicas_in_sync}\n')
with strategy.scope():
  model = tf.keras.Sequential([tf.keras.layers.Dense(10)])
  model.compile(
    loss=tf.keras.losses.MSE,
    optimizer=tf.keras.optimizers.Adam(),
    metrics=['accuracy']
  )

print('\nDONE\n')

Also tested the same code snippet with tf v2.4 and it ran fine in both cases.

Aug 27 '21 11:08 bryanlimy

I see a similar error when running the recommendation model from the models repo on TF2.5.0 and later on python3.8 https://github.com/tensorflow/models/tree/v2.5.1/official/recommendation

python ncf_keras_main.py --data_dir=./data --dataset=ml-1m

I0830 13:25:38.313174 140736018925024 ncf_keras_main.py:331] Keras evaluation is done.
I0830 13:25:38.313945 140736018925024 ncf_keras_main.py:555] Result is {'loss': 0.3801446557044983, 'eval_loss': 0.0, 'eval_hit_rate': 0.09089403396520465, 'step_timestamp_log': ['BatchTimestamp<batch_index: 0, timestamp: 1630344333.780229>', 'BatchTimestamp<batch_index: 100, timestamp: 1630344337.5320396>'], 'train_finish_time': 1630344337.9485285, 'avg_exp_per_second': 2638725.9874249455}
Exception ignored in: <function Pool.__del__ at 0x7fffa3c7cf70>
Traceback (most recent call last):
  File "/tmp/furmanek/miniconda3/envs/opence-conda-env-py3.8-cuda-openmpi-11.2/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "/tmp/furmanek/miniconda3/envs/opence-conda-env-py3.8-cuda-openmpi-11.2/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/tmp/furmanek/miniconda3/envs/opence-conda-env-py3.8-cuda-openmpi-11.2/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/tmp/furmanek/miniconda3/envs/opence-conda-env-py3.8-cuda-openmpi-11.2/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/tmp/furmanek/miniconda3/envs/opence-conda-env-py3.8-cuda-openmpi-11.2/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

Aug 30 '21 17:08 jayfurmanek

The other interesting thing is this only happens (for me at least) on py38 and py39. It runs just fine on py37, so maybe this is a python bug. Perhaps this one? https://bugs.python.org/issue39995

Aug 30 '21 20:08 jayfurmanek

I tried changing the MirroredStrategy to OneDeviceStrategy and the exception went away. So, not sure if it is an issue caused by both combination of python and TF problems.

Aug 31 '21 12:08 npanpaliya

This happens in TF 2. 7 too with python 3.9

I think it's because MirroredStrategy creates a multiprocessing ThreadPool, but doesn't close it before the program ends, so its resources aren't properly cleaned up and it errors on shutdown.

You can explicitly close the pool on exit using:

import atexit

....

strategy = tf.distribute.MirroredStrategy()

atexit.register(strategy._extended._collective_ops._pool.close) # type: ignore

Which should prevent the error for now (until there is a fix).

Dec 19 '21 00:12 tekumara

This happens in TF 2. 7 too with python 3.9

I think it's because MirroredStrategy creates a multiprocessing ThreadPool, but doesn't close it before the program ends, so its resources aren't properly cleaned up and it errors on shutdown.

You can explicitly close the pool on exit using:
import atexit

....

strategy = tf.distribute.MirroredStrategy()

atexit.register(strategy._extended._collective_ops._pool.close) # type: ignore
Which should prevent the error for now (until there is a fix).

This works for me, thank you!

Jan 14 '22 05:01 Tingbopku

For me in TF 2.5.0 the problem is hardware-dependant. It is present with V100, but not with 2080 Ti.

Feb 04 '22 14:02 negvet

This happens in TF 2. 7 too with python 3.9

I think it's because MirroredStrategy creates a multiprocessing ThreadPool, but doesn't close it before the program ends, so its resources aren't properly cleaned up and it errors on shutdown.

You can explicitly close the pool on exit using:
import atexit

....

strategy = tf.distribute.MirroredStrategy()

atexit.register(strategy._extended._collective_ops._pool.close) # type: ignore
Which should prevent the error for now (until there is a fix).

The same issue occurs with the MultiWorkerMirroredStrategy (when using it on one machine as recommended here), on Python 3.9.10 and tf 2.7

The fix is basically the same as this one, but you have to close two pools:

strategy = tf.distribute.MultiWorkerMirroredStrategy()

atexit.register(strategy._extended._cross_device_ops._pool.close) # type: ignore
atexit.register(strategy._extended._host_cross_device_ops._pool.close) #type: ignore

Mar 17 '22 21:03 niklaspechan

This happens in TF 2. 7 too with python 3.9 I think it's because MirroredStrategy creates a multiprocessing ThreadPool, but doesn't close it before the program ends, so its resources aren't properly cleaned up and it errors on shutdown. You can explicitly close the pool on exit using:
import atexit

....

strategy = tf.distribute.MirroredStrategy()

atexit.register(strategy._extended._collective_ops._pool.close) # type: ignore
Which should prevent the error for now (until there is a fix).
The same issue occurs with the MultiWorkerMirroredStrategy (when using it on one machine as recommended here), on Python 3.9.10 and tf 2.7

The fix is basically the same as this one, but you have to close two pools:
strategy = tf.distribute.MultiWorkerMirroredStrategy()

atexit.register(strategy._extended._cross_device_ops._pool.close) # type: ignore
atexit.register(strategy._extended._host_cross_device_ops._pool.close) #type: ignore

I use python3.8 and tf 2.8, this problem happens too. So I try to close pools, but it doesn't work.

my code:

    from tensorflow.python.distribute.cross_device_ops import AllReduceCrossDeviceOps
    ......
    dist_strategy = tf.distribute.MirroredStrategy(
            devices=["GPU:" + str(x) for x in range(FLAGS.n_gpus)],
            cross_device_ops=AllReduceCrossDeviceOps('nccl', num_packs=FLAGS.n_gpus))

if I use atexit.register(dist_strategy._extended._collective_ops._pool.close) ,it doesn't work; if I use atexit.register(dist_strategy._extended._cross_device_ops._pool.close), it raises error:'AllReduceCrossDeviceOps' object has no attribute '_pool'

what else can I do...

Mar 31 '22 05:03 FengYue95

I can report this happens with tensorflow 2.7.0 / python 3.8 on power pc.

The solution of @tekumara worked for me as well!

Apr 08 '22 21:04 krafczyk

Same issue in tf v2.6. OSError on program exit if strategy.scope() is called within a function.

The following code causes OSError on exit.

import tensorflow as tf

def main():
  strategy = tf.distribute.MirroredStrategy()
  print(f'\nNumber of devices: {strategy.num_replicas_in_sync}\n')
  with strategy.scope():
    model = tf.keras.Sequential([tf.keras.layers.Dense(10)])
    model.compile(
      loss=tf.keras.losses.MSE,
      optimizer=tf.keras.optimizers.Adam(),
      metrics=['accuracy']
    )

  print('\nDONE\n')

if __name__ == '__main__':
  main()

with the following output:

2021-08-27 12:00:25.516889: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-27 12:00:32.832857: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-08-27 12:00:32.832944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9659 MB memory:  -> device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:1c:00.0, compute capability: 7.5
2021-08-27 12:00:32.834864: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-08-27 12:00:32.834898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 9659 MB memory:  -> device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:1d:00.0, compute capability: 7.5

Number of devices: 2

DONE

Exception ignored in: <function Pool.__del__ at 0x7fbecd304040>
Traceback (most recent call last):
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

Whereas the one below is fine

import tensorflow as tf

strategy = tf.distribute.MirroredStrategy()
print(f'\nNumber of devices: {strategy.num_replicas_in_sync}\n')
with strategy.scope():
  model = tf.keras.Sequential([tf.keras.layers.Dense(10)])
  model.compile(
    loss=tf.keras.losses.MSE,
    optimizer=tf.keras.optimizers.Adam(),
    metrics=['accuracy']
  )

print('\nDONE\n')

Also tested the same code snippet with tf v2.4 and it ran fine in both cases.

Thank you, it worked

Apr 21 '22 14:04 shivarajkarki

Confirmed getting the same with tf 2.9.1 with keras tuner where its included in the tuner model: self.tuner = kt.tuners.Hyperband( ... distribution_strategy=tf.distribute.MirroredStrategy(), ... )

Jul 05 '22 21:07 winginitau

I tried changing the MirroredStrategy to OneDeviceStrategy and the exception went away. So, not sure if it is an issue caused by both combination of python and TF problems.

Hi, @npanpaliya can you share where did you modified to change MirroredStrategy to OneDeviceStrategy . Thnaks .

Jul 06 '22 02:07 suchunxie

In my case, I'd to specify --distribution_strategy=one_device here in my tests https://github.com/open-ce/tensorflow-feedstock/blob/main/tests/open-ce-tests.yaml#L22

Jul 06 '22 05:07 npanpaliya

In my case, I'd to specify --distribution_strategy=one_device here in my tests https://github.com/open-ce/tensorflow-feedstock/blob/main/tests/open-ce-tests.yaml#L22

@npanpaliya I'm using tensorflow model -garden, and tired your way to add strategy but tat parameter is not allowed in my case.

Jul 06 '22 11:07 suchunxie

This happens in TF 2. 7 too with python 3.9

I think it's because MirroredStrategy creates a multiprocessing ThreadPool, but doesn't close it before the program ends, so its resources aren't properly cleaned up and it errors on shutdown.

You can explicitly close the pool on exit using:
import atexit

....

strategy = tf.distribute.MirroredStrategy()

atexit.register(strategy._extended._collective_ops._pool.close) # type: ignore
Which should prevent the error for now (until there is a fix).

@tekumara could you please tell me in which Script should I add these code? I tried in the pool.py file but not work.

Jul 06 '22 11:07 suchunxie

Which scripts of tensorflow/models are you trying?

Jul 06 '22 14:07 npanpaliya

@npanpaliya I'm training Bert using the run_pretraining.py https://github.com/suchunxie/models/blob/master/official/nlp/bert/run_pretraining.py here, and got the error of Bad descriptor. Then I referenced the post of yours, changed the python3.8/multiprocessing/pool.py file where shows the error.(see the picture below) [image: error.png] (My environment is Ubuntu+docker+nvidia-tensorflow container. )

Jul 06 '22 14:07 suchunxie

@suchunxie - You can specify strategy here https://github.com/suchunxie/models/blob/master/official/nlp/bert/run_pretraining.py#L207. "one_device" is supported https://github.com/suchunxie/models/blob/65e571fdc903873362e59abe0aeec5c8018da750/official/common/distribute_utils.py#L158.

Jul 06 '22 15:07 npanpaliya

Hi, @npanpaliya It workes! I tried this way before but not worked, and after you pointed to me I checked it again, found there's a back slash lost before I pass --distribution_strategy. Stupid me > <. Thanks greatly for your help !

Jul 07 '22 02:07 suchunxie

Hi @suchunxie, This is great! Glad to hear this! :)

Jul 07 '22 03:07 npanpaliya

It seems that a fix is submitted https://github.com/tensorflow/tensorflow/issues/56279#issuecomment-1151621844 and users need to wait for tf 2.10 release.

Jul 19 '22 05:07 QuantHao

Hi @charliermarsh, Looks like issue is resolved with stable version Tensorflow 2.9

>>> import tensorflow
>>> def f():
...    strategy = tensorflow.distribute.MirroredStrategy()
...    with strategy.scope():
...       tensorflow.keras.layers.Conv2D(64, (3, 3), activation="relu", padding="same")(
...             tensorflow.keras.layers.Input(shape=(88, 88, 3))
...         )
... 
>>> f()
2022-07-29 10:47:18.249928: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.250980: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.377905: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.378958: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.379854: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.380701: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.384793: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-29 10:47:18.841990: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.842974: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.843762: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.844505: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.845247: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.846005: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.846699: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.847686: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.848526: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.849336: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.850099: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.850823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13725 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
2022-07-29 10:47:20.854556: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.855362: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 13791 MB memory:  -> device: 1, name: Tesla T4, pci bus id: 0000:00:05.0, compute capability: 7.5
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')

Jul 29 '22 10:07 gadagashwini

Not for me. Using TensorFlow 2.9.1 when exiting the interpreter, it shows the exception:

In [1]: import tensorflow
   ...: def f():
   ...:    strategy = tensorflow.distribute.MirroredStrategy()
   ...:    with strategy.scope():
   ...:       tensorflow.keras.layers.Conv2D(64, (3, 3), activation="relu", padding="same")(
   ...:             tensorflow.keras.layers.Input(shape=(88, 88, 3))
   ...:         )
   ...: f()
2022-07-29 12:54:45.169943: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-29 12:54:47.305006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 429 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:17:00.0, compute capability: 7.5
2022-07-29 12:54:47.305948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 9651 MB memory:  -> device: 1, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:18:00.0, compute capability: 7.5
2022-07-29 12:54:47.306459: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 427 MB memory:  -> device: 2, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5
2022-07-29 12:54:47.306939: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 429 MB memory:  -> device: 3, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:b4:00.0, compute capability: 7.5
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')

In [2]:
Do you really want to exit ([y]/n)?
Exception ignored in: <function Pool.__del__ at 0x7ff160d75c10>
Traceback (most recent call last):
  File "/home/user/miniconda3/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "/home/user/miniconda3/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/home/user/miniconda3/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/home/user/miniconda3/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/home/user/miniconda3/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

Jul 29 '22 12:07 ZJaume

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

Aug 05 '22 13:08 google-ml-butler[bot]

Hi @ZJaume, Could you share the system configuration, I am not able to replicate the issue. Thank you!

Aug 09 '22 11:08 gadagashwini

Hi, sorry for the inconvenience but now I've tried with a fresh new virtual environment and the error just disappeared, so I think the issue can be closed. The virtual environment that is throwing the exception has had many different tensorflow versions from 2.3 to 2.9. Maybe some outdated dependency is causing the error.

In case you want to reproduce it my versions are: Tensorflow version: 2.9.1 Python version: 2.8.13 OS: Ubuntu 18.04

And the output of pip freeze:

absl-py==1.1.0
aiohttp==3.8.1
aiosignal==1.2.0
antlr4-python3-runtime==4.8
astunparse==1.6.3
async-timeout==4.0.1
atomicwrites==1.4.0
attrs==21.2.0
backcall==0.2.0
bitarray==2.3.7
blessed==1.19.0
cachetools==4.2.4
certifi==2021.10.8
cffi==1.15.0
charset-normalizer==2.0.7
clang==5.0
click==8.0.3
colorama==0.4.4
Cython==0.29.24
dataclasses==0.6
datasets==1.16.1
decorator==5.1.0
dill==0.3.4
enlighten==1.10.1
fairseq==0.10.2
fastspell==0.1.5
fasttext==0.9.2
filelock==3.3.2
flatbuffers==1.12
frozenlist==1.2.0
fsspec==2021.11.1
ftfy==6.1.1
fuzzywuzzy==0.18.0
gast==0.4.0
gensim==4.1.2
google-auth==1.35.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
grpcio==1.41.1
h5py==3.1.0
hanzidentifier==1.0.2
huggingface-hub==0.1.0
hunspell==0.5.5
hydra-core==1.1.1
idna==3.3
importlib-resources==5.4.0
ipython==7.29.0
jedi==0.18.0
joblib==0.14.1
keras==2.9.0
Keras-Preprocessing==1.1.2
latexcodec==2.0.1
libclang==13.0.0
Markdown==3.3.4
matplotlib-inline==0.1.3
monocleaner==1.0
more-itertools==8.10.0
mtdata==0.3.1
multidict==5.2.0
multiprocess==0.70.12.2
nltk==3.6.5
numpy==1.23.0
oauthlib==3.1.1
omegaconf==2.1.1
opt-einsum==3.3.0
packaging==21.2
pandas==1.3.5
parso==0.8.2
pexpect==4.8.0
pickleshare==0.7.5
Pillow==8.4.0
pluggy==0.13.1
portalocker==2.3.0
prefixed==0.3.2
prompt-toolkit==3.0.22
protobuf==3.19.1
psutil==5.8.0
ptyprocess==0.7.0
py==1.10.0
pyarrow==6.0.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.8.1
pybtex==0.24.0
pycld2==0.31
pycparser==2.21
Pygments==2.10.0
pyparsing==2.4.7
pypinyin==0.46.0
pytest==5.1.2
python-dateutil==2.8.2
python-Levenshtein==0.12.2
pytz==2021.3
PyYAML==5.4.1
regex==2022.3.2
requests==2.26.0
requests-oauthlib==1.3.0
rsa==4.7.2
ruamel.yaml==0.17.17
ruamel.yaml.clib==0.2.6
sacrebleu==2.1.0
sacremoses==0.0.43
scikit-learn==0.22.1
scipy==1.4.1
sentence-transformers==2.1.0
sentencepiece==0.1.94
six==1.15.0
smart-open==5.2.1
tabulate==0.8.9
tensorboard==2.9.1
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.0
tensorflow==2.9.1
tensorflow-estimator==2.9.0
tensorflow-io-gcs-filesystem==0.24.0
termcolor==1.1.0
tf-estimator-nightly==2.8.0.dev2021122109
threadpoolctl==3.0.0
tokenizers==0.12.1
toolwrapper==0.4.1
torch==1.10.1
torch-train==0.0.3
torchsummary==1.5.1
torchvision==0.11.2
tqdm==4.62.3
traitlets==5.1.1
transformers==4.20.1
typing-extensions==3.7.4.3
Unidecode==1.2.0
urllib3==1.26.7
wcwidth==0.2.5
Werkzeug==2.0.2
wrapt==1.12.1
xxhash==2.0.2
yarl==1.7.2
zhon==1.1.5
zipp==3.7.0

Aug 09 '22 13:08 ZJaume

Closing as stale. Please reopen if you'd like to work on this further.

Aug 16 '22 13:08 google-ml-butler[bot]

Are you satisfied with the resolution of your issue? Yes No

Aug 16 '22 13:08 google-ml-butler[bot]

tensorflow tensorflow copied to clipboard

OSError: [Errno 9] Bad file descriptor raised on program exit

tensorflow
tensorflow copied to clipboard