tensorflow icon indicating copy to clipboard operation
tensorflow copied to clipboard

OSError: [Errno 9] Bad file descriptor raised on program exit

Open charliermarsh opened this issue 4 years ago • 30 comments
trafficstars

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): v2.5.0-rc3-213-ga4dfb8d1a71 2.5.0
  • Python version: Python 3.8.5
  • CUDA/cuDNN version: 11.2 / 8.1.0.77-1
  • GPU model and memory: P100

Describe the current behavior

When using MirroredStrategy as a context manager, Python raises an ignored exception on program exit:

Exception ignored in: <function Pool.__del__ at 0x7f21f942e4c0>
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "/root/miniconda3/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/root/miniconda3/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/root/miniconda3/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/root/miniconda3/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

Describe the expected behavior

Python exits without the aforementioned exception. (In my testing, there is no such exception raised on TensorFlow 2.4.0, so this seems new in TensorFlow 2.5.0.)

Contributing

  • Do you want to contribute a PR? (yes/no): No

Standalone code to reproduce the issue

import tensorflow


def f():
    strategy = tensorflow.distribute.MirroredStrategy()
    with strategy.scope():
        tensorflow.keras.layers.Conv2D(64, (3, 3), activation="relu", padding="same")(
            tensorflow.keras.layers.Input(shape=(88, 88, 3))
        )


f()

Removing the strategy.scope() causes the program to exit without the ignored exception, as does removing the function definition (i.e., getting rid of def f() and f(), and invoking at the top level).

charliermarsh avatar Jun 28 '21 15:06 charliermarsh

can confirm this in tf2.5.0 from pypi

SysuJayce avatar Jun 29 '21 06:06 SysuJayce

@crm416 ,

Can you please try to execute the code in tf v2.5 and let us know if you are facing same issue? Thanks!

tilakrayal avatar Jun 29 '21 09:06 tilakrayal

@tilakrayal - Yes, this only occurs for me in tf v2.5 (and not in tf v2.3 or tf v2.4).

charliermarsh avatar Jun 29 '21 13:06 charliermarsh

Same issue in tf v2.6. OSError on program exit if strategy.scope() is called within a function.

The following code causes OSError on exit.

import tensorflow as tf

def main():
  strategy = tf.distribute.MirroredStrategy()
  print(f'\nNumber of devices: {strategy.num_replicas_in_sync}\n')
  with strategy.scope():
    model = tf.keras.Sequential([tf.keras.layers.Dense(10)])
    model.compile(
      loss=tf.keras.losses.MSE,
      optimizer=tf.keras.optimizers.Adam(),
      metrics=['accuracy']
    )

  print('\nDONE\n')

if __name__ == '__main__':
  main()

with the following output:

2021-08-27 12:00:25.516889: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-27 12:00:32.832857: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-08-27 12:00:32.832944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9659 MB memory:  -> device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:1c:00.0, compute capability: 7.5
2021-08-27 12:00:32.834864: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-08-27 12:00:32.834898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 9659 MB memory:  -> device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:1d:00.0, compute capability: 7.5

Number of devices: 2

DONE

Exception ignored in: <function Pool.__del__ at 0x7fbecd304040>
Traceback (most recent call last):
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

Whereas the one below is fine

import tensorflow as tf

strategy = tf.distribute.MirroredStrategy()
print(f'\nNumber of devices: {strategy.num_replicas_in_sync}\n')
with strategy.scope():
  model = tf.keras.Sequential([tf.keras.layers.Dense(10)])
  model.compile(
    loss=tf.keras.losses.MSE,
    optimizer=tf.keras.optimizers.Adam(),
    metrics=['accuracy']
  )

print('\nDONE\n')

Also tested the same code snippet with tf v2.4 and it ran fine in both cases.

bryanlimy avatar Aug 27 '21 11:08 bryanlimy

I see a similar error when running the recommendation model from the models repo on TF2.5.0 and later on python3.8 https://github.com/tensorflow/models/tree/v2.5.1/official/recommendation

python ncf_keras_main.py --data_dir=./data --dataset=ml-1m
I0830 13:25:38.313174 140736018925024 ncf_keras_main.py:331] Keras evaluation is done.
I0830 13:25:38.313945 140736018925024 ncf_keras_main.py:555] Result is {'loss': 0.3801446557044983, 'eval_loss': 0.0, 'eval_hit_rate': 0.09089403396520465, 'step_timestamp_log': ['BatchTimestamp<batch_index: 0, timestamp: 1630344333.780229>', 'BatchTimestamp<batch_index: 100, timestamp: 1630344337.5320396>'], 'train_finish_time': 1630344337.9485285, 'avg_exp_per_second': 2638725.9874249455}
Exception ignored in: <function Pool.__del__ at 0x7fffa3c7cf70>
Traceback (most recent call last):
  File "/tmp/furmanek/miniconda3/envs/opence-conda-env-py3.8-cuda-openmpi-11.2/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "/tmp/furmanek/miniconda3/envs/opence-conda-env-py3.8-cuda-openmpi-11.2/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/tmp/furmanek/miniconda3/envs/opence-conda-env-py3.8-cuda-openmpi-11.2/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/tmp/furmanek/miniconda3/envs/opence-conda-env-py3.8-cuda-openmpi-11.2/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/tmp/furmanek/miniconda3/envs/opence-conda-env-py3.8-cuda-openmpi-11.2/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

jayfurmanek avatar Aug 30 '21 17:08 jayfurmanek

The other interesting thing is this only happens (for me at least) on py38 and py39. It runs just fine on py37, so maybe this is a python bug. Perhaps this one? https://bugs.python.org/issue39995

jayfurmanek avatar Aug 30 '21 20:08 jayfurmanek

I tried changing the MirroredStrategy to OneDeviceStrategy and the exception went away. So, not sure if it is an issue caused by both combination of python and TF problems.

npanpaliya avatar Aug 31 '21 12:08 npanpaliya

This happens in TF 2. 7 too with python 3.9

I think it's because MirroredStrategy creates a multiprocessing ThreadPool, but doesn't close it before the program ends, so its resources aren't properly cleaned up and it errors on shutdown.

You can explicitly close the pool on exit using:

import atexit

....

strategy = tf.distribute.MirroredStrategy()

atexit.register(strategy._extended._collective_ops._pool.close) # type: ignore

Which should prevent the error for now (until there is a fix).

tekumara avatar Dec 19 '21 00:12 tekumara

This happens in TF 2. 7 too with python 3.9

I think it's because MirroredStrategy creates a multiprocessing ThreadPool, but doesn't close it before the program ends, so its resources aren't properly cleaned up and it errors on shutdown.

You can explicitly close the pool on exit using:

import atexit

....

strategy = tf.distribute.MirroredStrategy()

atexit.register(strategy._extended._collective_ops._pool.close) # type: ignore

Which should prevent the error for now (until there is a fix).

This works for me, thank you!

Tingbopku avatar Jan 14 '22 05:01 Tingbopku

For me in TF 2.5.0 the problem is hardware-dependant. It is present with V100, but not with 2080 Ti.

negvet avatar Feb 04 '22 14:02 negvet

This happens in TF 2. 7 too with python 3.9

I think it's because MirroredStrategy creates a multiprocessing ThreadPool, but doesn't close it before the program ends, so its resources aren't properly cleaned up and it errors on shutdown.

You can explicitly close the pool on exit using:

import atexit

....

strategy = tf.distribute.MirroredStrategy()

atexit.register(strategy._extended._collective_ops._pool.close) # type: ignore

Which should prevent the error for now (until there is a fix).

The same issue occurs with the MultiWorkerMirroredStrategy (when using it on one machine as recommended here), on Python 3.9.10 and tf 2.7

The fix is basically the same as this one, but you have to close two pools:

strategy = tf.distribute.MultiWorkerMirroredStrategy()

atexit.register(strategy._extended._cross_device_ops._pool.close) # type: ignore
atexit.register(strategy._extended._host_cross_device_ops._pool.close) #type: ignore

niklaspechan avatar Mar 17 '22 21:03 niklaspechan

This happens in TF 2. 7 too with python 3.9 I think it's because MirroredStrategy creates a multiprocessing ThreadPool, but doesn't close it before the program ends, so its resources aren't properly cleaned up and it errors on shutdown. You can explicitly close the pool on exit using:

import atexit

....

strategy = tf.distribute.MirroredStrategy()

atexit.register(strategy._extended._collective_ops._pool.close) # type: ignore

Which should prevent the error for now (until there is a fix).

The same issue occurs with the MultiWorkerMirroredStrategy (when using it on one machine as recommended here), on Python 3.9.10 and tf 2.7

The fix is basically the same as this one, but you have to close two pools:

strategy = tf.distribute.MultiWorkerMirroredStrategy()

atexit.register(strategy._extended._cross_device_ops._pool.close) # type: ignore
atexit.register(strategy._extended._host_cross_device_ops._pool.close) #type: ignore

I use python3.8 and tf 2.8, this problem happens too. So I try to close pools, but it doesn't work.

my code:

    from tensorflow.python.distribute.cross_device_ops import AllReduceCrossDeviceOps
    ......
    dist_strategy = tf.distribute.MirroredStrategy(
            devices=["GPU:" + str(x) for x in range(FLAGS.n_gpus)],
            cross_device_ops=AllReduceCrossDeviceOps('nccl', num_packs=FLAGS.n_gpus))

if I use atexit.register(dist_strategy._extended._collective_ops._pool.close) ,it doesn't work; if I use atexit.register(dist_strategy._extended._cross_device_ops._pool.close), it raises error:'AllReduceCrossDeviceOps' object has no attribute '_pool'

what else can I do...

FengYue95 avatar Mar 31 '22 05:03 FengYue95

I can report this happens with tensorflow 2.7.0 / python 3.8 on power pc.

The solution of @tekumara worked for me as well!

krafczyk avatar Apr 08 '22 21:04 krafczyk

Same issue in tf v2.6. OSError on program exit if strategy.scope() is called within a function.

The following code causes OSError on exit.

import tensorflow as tf

def main():
  strategy = tf.distribute.MirroredStrategy()
  print(f'\nNumber of devices: {strategy.num_replicas_in_sync}\n')
  with strategy.scope():
    model = tf.keras.Sequential([tf.keras.layers.Dense(10)])
    model.compile(
      loss=tf.keras.losses.MSE,
      optimizer=tf.keras.optimizers.Adam(),
      metrics=['accuracy']
    )

  print('\nDONE\n')

if __name__ == '__main__':
  main()

with the following output:

2021-08-27 12:00:25.516889: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-27 12:00:32.832857: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-08-27 12:00:32.832944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9659 MB memory:  -> device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:1c:00.0, compute capability: 7.5
2021-08-27 12:00:32.834864: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-08-27 12:00:32.834898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 9659 MB memory:  -> device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:1d:00.0, compute capability: 7.5

Number of devices: 2

DONE

Exception ignored in: <function Pool.__del__ at 0x7fbecd304040>
Traceback (most recent call last):
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/miniconda3/envs/test/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

Whereas the one below is fine

import tensorflow as tf

strategy = tf.distribute.MirroredStrategy()
print(f'\nNumber of devices: {strategy.num_replicas_in_sync}\n')
with strategy.scope():
  model = tf.keras.Sequential([tf.keras.layers.Dense(10)])
  model.compile(
    loss=tf.keras.losses.MSE,
    optimizer=tf.keras.optimizers.Adam(),
    metrics=['accuracy']
  )

print('\nDONE\n')

Also tested the same code snippet with tf v2.4 and it ran fine in both cases.

Thank you, it worked

shivarajkarki avatar Apr 21 '22 14:04 shivarajkarki

Confirmed getting the same with tf 2.9.1 with keras tuner where its included in the tuner model: self.tuner = kt.tuners.Hyperband( ... distribution_strategy=tf.distribute.MirroredStrategy(), ... )

winginitau avatar Jul 05 '22 21:07 winginitau

I tried changing the MirroredStrategy to OneDeviceStrategy and the exception went away. So, not sure if it is an issue caused by both combination of python and TF problems.

Hi, @npanpaliya can you share where did you modified to change MirroredStrategy to OneDeviceStrategy . Thnaks .

suchunxie avatar Jul 06 '22 02:07 suchunxie

In my case, I'd to specify --distribution_strategy=one_device here in my tests https://github.com/open-ce/tensorflow-feedstock/blob/main/tests/open-ce-tests.yaml#L22

npanpaliya avatar Jul 06 '22 05:07 npanpaliya

In my case, I'd to specify --distribution_strategy=one_device here in my tests https://github.com/open-ce/tensorflow-feedstock/blob/main/tests/open-ce-tests.yaml#L22

@npanpaliya I'm using tensorflow model -garden, and tired your way to add strategy but tat parameter is not allowed in my case.

suchunxie avatar Jul 06 '22 11:07 suchunxie

This happens in TF 2. 7 too with python 3.9

I think it's because MirroredStrategy creates a multiprocessing ThreadPool, but doesn't close it before the program ends, so its resources aren't properly cleaned up and it errors on shutdown.

You can explicitly close the pool on exit using:

import atexit

....

strategy = tf.distribute.MirroredStrategy()

atexit.register(strategy._extended._collective_ops._pool.close) # type: ignore

Which should prevent the error for now (until there is a fix).

@tekumara could you please tell me in which Script should I add these code? I tried in the pool.py file but not work.

suchunxie avatar Jul 06 '22 11:07 suchunxie

Which scripts of tensorflow/models are you trying?

npanpaliya avatar Jul 06 '22 14:07 npanpaliya

@npanpaliya I'm training Bert using the run_pretraining.py https://github.com/suchunxie/models/blob/master/official/nlp/bert/run_pretraining.py here, and got the error of Bad descriptor. Then I referenced the post of yours, changed the python3.8/multiprocessing/pool.py file where shows the error.(see the picture below) [image: error.png] (My environment is Ubuntu+docker+nvidia-tensorflow container. )

suchunxie avatar Jul 06 '22 14:07 suchunxie

@suchunxie - You can specify strategy here https://github.com/suchunxie/models/blob/master/official/nlp/bert/run_pretraining.py#L207. "one_device" is supported https://github.com/suchunxie/models/blob/65e571fdc903873362e59abe0aeec5c8018da750/official/common/distribute_utils.py#L158.

npanpaliya avatar Jul 06 '22 15:07 npanpaliya

Hi, @npanpaliya It workes! I tried this way before but not worked, and after you pointed to me I checked it again, found there's a back slash lost before I pass --distribution_strategy. Stupid me > <. Thanks greatly for your help !

suchunxie avatar Jul 07 '22 02:07 suchunxie

Hi @suchunxie, This is great! Glad to hear this! :)

npanpaliya avatar Jul 07 '22 03:07 npanpaliya

It seems that a fix is submitted https://github.com/tensorflow/tensorflow/issues/56279#issuecomment-1151621844 and users need to wait for tf 2.10 release.

QuantHao avatar Jul 19 '22 05:07 QuantHao

Hi @charliermarsh, Looks like issue is resolved with stable version Tensorflow 2.9

>>> import tensorflow
>>> def f():
...    strategy = tensorflow.distribute.MirroredStrategy()
...    with strategy.scope():
...       tensorflow.keras.layers.Conv2D(64, (3, 3), activation="relu", padding="same")(
...             tensorflow.keras.layers.Input(shape=(88, 88, 3))
...         )
... 
>>> f()
2022-07-29 10:47:18.249928: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.250980: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.377905: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.378958: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.379854: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.380701: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.384793: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-29 10:47:18.841990: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.842974: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.843762: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.844505: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.845247: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.846005: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.846699: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.847686: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.848526: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.849336: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.850099: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.850823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13725 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
2022-07-29 10:47:20.854556: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.855362: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 13791 MB memory:  -> device: 1, name: Tesla T4, pci bus id: 0000:00:05.0, compute capability: 7.5
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')

gadagashwini avatar Jul 29 '22 10:07 gadagashwini

Not for me. Using TensorFlow 2.9.1 when exiting the interpreter, it shows the exception:

In [1]: import tensorflow
   ...: def f():
   ...:    strategy = tensorflow.distribute.MirroredStrategy()
   ...:    with strategy.scope():
   ...:       tensorflow.keras.layers.Conv2D(64, (3, 3), activation="relu", padding="same")(
   ...:             tensorflow.keras.layers.Input(shape=(88, 88, 3))
   ...:         )
   ...: f()
2022-07-29 12:54:45.169943: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-29 12:54:47.305006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 429 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:17:00.0, compute capability: 7.5
2022-07-29 12:54:47.305948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 9651 MB memory:  -> device: 1, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:18:00.0, compute capability: 7.5
2022-07-29 12:54:47.306459: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 427 MB memory:  -> device: 2, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5
2022-07-29 12:54:47.306939: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 429 MB memory:  -> device: 3, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:b4:00.0, compute capability: 7.5
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')

In [2]:
Do you really want to exit ([y]/n)?
Exception ignored in: <function Pool.__del__ at 0x7ff160d75c10>
Traceback (most recent call last):
  File "/home/user/miniconda3/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
    self._change_notifier.put(None)
  File "/home/user/miniconda3/lib/python3.8/multiprocessing/queues.py", line 368, in put
    self._writer.send_bytes(obj)
  File "/home/user/miniconda3/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/home/user/miniconda3/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
    self._send(header + buf)
  File "/home/user/miniconda3/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor

ZJaume avatar Jul 29 '22 12:07 ZJaume

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

google-ml-butler[bot] avatar Aug 05 '22 13:08 google-ml-butler[bot]

Hi @ZJaume, Could you share the system configuration, I am not able to replicate the issue. Thank you!

gadagashwini avatar Aug 09 '22 11:08 gadagashwini

Hi, sorry for the inconvenience but now I've tried with a fresh new virtual environment and the error just disappeared, so I think the issue can be closed. The virtual environment that is throwing the exception has had many different tensorflow versions from 2.3 to 2.9. Maybe some outdated dependency is causing the error.

In case you want to reproduce it my versions are: Tensorflow version: 2.9.1 Python version: 2.8.13 OS: Ubuntu 18.04

And the output of pip freeze:

absl-py==1.1.0
aiohttp==3.8.1
aiosignal==1.2.0
antlr4-python3-runtime==4.8
astunparse==1.6.3
async-timeout==4.0.1
atomicwrites==1.4.0
attrs==21.2.0
backcall==0.2.0
bitarray==2.3.7
blessed==1.19.0
cachetools==4.2.4
certifi==2021.10.8
cffi==1.15.0
charset-normalizer==2.0.7
clang==5.0
click==8.0.3
colorama==0.4.4
Cython==0.29.24
dataclasses==0.6
datasets==1.16.1
decorator==5.1.0
dill==0.3.4
enlighten==1.10.1
fairseq==0.10.2
fastspell==0.1.5
fasttext==0.9.2
filelock==3.3.2
flatbuffers==1.12
frozenlist==1.2.0
fsspec==2021.11.1
ftfy==6.1.1
fuzzywuzzy==0.18.0
gast==0.4.0
gensim==4.1.2
google-auth==1.35.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
grpcio==1.41.1
h5py==3.1.0
hanzidentifier==1.0.2
huggingface-hub==0.1.0
hunspell==0.5.5
hydra-core==1.1.1
idna==3.3
importlib-resources==5.4.0
ipython==7.29.0
jedi==0.18.0
joblib==0.14.1
keras==2.9.0
Keras-Preprocessing==1.1.2
latexcodec==2.0.1
libclang==13.0.0
Markdown==3.3.4
matplotlib-inline==0.1.3
monocleaner==1.0
more-itertools==8.10.0
mtdata==0.3.1
multidict==5.2.0
multiprocess==0.70.12.2
nltk==3.6.5
numpy==1.23.0
oauthlib==3.1.1
omegaconf==2.1.1
opt-einsum==3.3.0
packaging==21.2
pandas==1.3.5
parso==0.8.2
pexpect==4.8.0
pickleshare==0.7.5
Pillow==8.4.0
pluggy==0.13.1
portalocker==2.3.0
prefixed==0.3.2
prompt-toolkit==3.0.22
protobuf==3.19.1
psutil==5.8.0
ptyprocess==0.7.0
py==1.10.0
pyarrow==6.0.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.8.1
pybtex==0.24.0
pycld2==0.31
pycparser==2.21
Pygments==2.10.0
pyparsing==2.4.7
pypinyin==0.46.0
pytest==5.1.2
python-dateutil==2.8.2
python-Levenshtein==0.12.2
pytz==2021.3
PyYAML==5.4.1
regex==2022.3.2
requests==2.26.0
requests-oauthlib==1.3.0
rsa==4.7.2
ruamel.yaml==0.17.17
ruamel.yaml.clib==0.2.6
sacrebleu==2.1.0
sacremoses==0.0.43
scikit-learn==0.22.1
scipy==1.4.1
sentence-transformers==2.1.0
sentencepiece==0.1.94
six==1.15.0
smart-open==5.2.1
tabulate==0.8.9
tensorboard==2.9.1
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.0
tensorflow==2.9.1
tensorflow-estimator==2.9.0
tensorflow-io-gcs-filesystem==0.24.0
termcolor==1.1.0
tf-estimator-nightly==2.8.0.dev2021122109
threadpoolctl==3.0.0
tokenizers==0.12.1
toolwrapper==0.4.1
torch==1.10.1
torch-train==0.0.3
torchsummary==1.5.1
torchvision==0.11.2
tqdm==4.62.3
traitlets==5.1.1
transformers==4.20.1
typing-extensions==3.7.4.3
Unidecode==1.2.0
urllib3==1.26.7
wcwidth==0.2.5
Werkzeug==2.0.2
wrapt==1.12.1
xxhash==2.0.2
yarl==1.7.2
zhon==1.1.5
zipp==3.7.0

ZJaume avatar Aug 09 '22 13:08 ZJaume

Closing as stale. Please reopen if you'd like to work on this further.

google-ml-butler[bot] avatar Aug 16 '22 13:08 google-ml-butler[bot]

Are you satisfied with the resolution of your issue? Yes No

google-ml-butler[bot] avatar Aug 16 '22 13:08 google-ml-butler[bot]