tensorflow
tensorflow copied to clipboard
OSError: [Errno 9] Bad file descriptor raised on program exit
System information
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
- TensorFlow installed from (source or binary): binary
- TensorFlow version (use command below):
v2.5.0-rc3-213-ga4dfb8d1a71 2.5.0 - Python version:
Python 3.8.5 - CUDA/cuDNN version:
11.2/8.1.0.77-1 - GPU model and memory: P100
Describe the current behavior
When using MirroredStrategy as a context manager, Python raises an ignored exception on program exit:
Exception ignored in: <function Pool.__del__ at 0x7f21f942e4c0>
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
self._change_notifier.put(None)
File "/root/miniconda3/lib/python3.8/multiprocessing/queues.py", line 368, in put
self._writer.send_bytes(obj)
File "/root/miniconda3/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/root/miniconda3/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
self._send(header + buf)
File "/root/miniconda3/lib/python3.8/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor
Describe the expected behavior
Python exits without the aforementioned exception. (In my testing, there is no such exception raised on TensorFlow 2.4.0, so this seems new in TensorFlow 2.5.0.)
- Do you want to contribute a PR? (yes/no): No
Standalone code to reproduce the issue
import tensorflow
def f():
strategy = tensorflow.distribute.MirroredStrategy()
with strategy.scope():
tensorflow.keras.layers.Conv2D(64, (3, 3), activation="relu", padding="same")(
tensorflow.keras.layers.Input(shape=(88, 88, 3))
)
f()
Removing the strategy.scope() causes the program to exit without the ignored exception, as does removing the function definition (i.e., getting rid of def f() and f(), and invoking at the top level).
can confirm this in tf2.5.0 from pypi
@crm416 ,
Can you please try to execute the code in tf v2.5 and let us know if you are facing same issue? Thanks!
@tilakrayal - Yes, this only occurs for me in tf v2.5 (and not in tf v2.3 or tf v2.4).
Same issue in tf v2.6. OSError on program exit if strategy.scope() is called within a function.
The following code causes OSError on exit.
import tensorflow as tf
def main():
strategy = tf.distribute.MirroredStrategy()
print(f'\nNumber of devices: {strategy.num_replicas_in_sync}\n')
with strategy.scope():
model = tf.keras.Sequential([tf.keras.layers.Dense(10)])
model.compile(
loss=tf.keras.losses.MSE,
optimizer=tf.keras.optimizers.Adam(),
metrics=['accuracy']
)
print('\nDONE\n')
if __name__ == '__main__':
main()
with the following output:
2021-08-27 12:00:25.516889: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-27 12:00:32.832857: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-08-27 12:00:32.832944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9659 MB memory: -> device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:1c:00.0, compute capability: 7.5
2021-08-27 12:00:32.834864: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2021-08-27 12:00:32.834898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 9659 MB memory: -> device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:1d:00.0, compute capability: 7.5
Number of devices: 2
DONE
Exception ignored in: <function Pool.__del__ at 0x7fbecd304040>
Traceback (most recent call last):
File "/miniconda3/envs/test/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
self._change_notifier.put(None)
File "/miniconda3/envs/test/lib/python3.8/multiprocessing/queues.py", line 368, in put
self._writer.send_bytes(obj)
File "/miniconda3/envs/test/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/miniconda3/envs/test/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
self._send(header + buf)
File "/miniconda3/envs/test/lib/python3.8/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor
Whereas the one below is fine
import tensorflow as tf
strategy = tf.distribute.MirroredStrategy()
print(f'\nNumber of devices: {strategy.num_replicas_in_sync}\n')
with strategy.scope():
model = tf.keras.Sequential([tf.keras.layers.Dense(10)])
model.compile(
loss=tf.keras.losses.MSE,
optimizer=tf.keras.optimizers.Adam(),
metrics=['accuracy']
)
print('\nDONE\n')
Also tested the same code snippet with tf v2.4 and it ran fine in both cases.
I see a similar error when running the recommendation model from the models repo on TF2.5.0 and later on python3.8 https://github.com/tensorflow/models/tree/v2.5.1/official/recommendation
python ncf_keras_main.py --data_dir=./data --dataset=ml-1m
I0830 13:25:38.313174 140736018925024 ncf_keras_main.py:331] Keras evaluation is done.
I0830 13:25:38.313945 140736018925024 ncf_keras_main.py:555] Result is {'loss': 0.3801446557044983, 'eval_loss': 0.0, 'eval_hit_rate': 0.09089403396520465, 'step_timestamp_log': ['BatchTimestamp<batch_index: 0, timestamp: 1630344333.780229>', 'BatchTimestamp<batch_index: 100, timestamp: 1630344337.5320396>'], 'train_finish_time': 1630344337.9485285, 'avg_exp_per_second': 2638725.9874249455}
Exception ignored in: <function Pool.__del__ at 0x7fffa3c7cf70>
Traceback (most recent call last):
File "/tmp/furmanek/miniconda3/envs/opence-conda-env-py3.8-cuda-openmpi-11.2/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
self._change_notifier.put(None)
File "/tmp/furmanek/miniconda3/envs/opence-conda-env-py3.8-cuda-openmpi-11.2/lib/python3.8/multiprocessing/queues.py", line 368, in put
self._writer.send_bytes(obj)
File "/tmp/furmanek/miniconda3/envs/opence-conda-env-py3.8-cuda-openmpi-11.2/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/tmp/furmanek/miniconda3/envs/opence-conda-env-py3.8-cuda-openmpi-11.2/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
self._send(header + buf)
File "/tmp/furmanek/miniconda3/envs/opence-conda-env-py3.8-cuda-openmpi-11.2/lib/python3.8/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor
The other interesting thing is this only happens (for me at least) on py38 and py39. It runs just fine on py37, so maybe this is a python bug. Perhaps this one? https://bugs.python.org/issue39995
I tried changing the MirroredStrategy to OneDeviceStrategy and the exception went away. So, not sure if it is an issue caused by both combination of python and TF problems.
This happens in TF 2. 7 too with python 3.9
I think it's because MirroredStrategy creates a multiprocessing ThreadPool, but doesn't close it before the program ends, so its resources aren't properly cleaned up and it errors on shutdown.
You can explicitly close the pool on exit using:
import atexit
....
strategy = tf.distribute.MirroredStrategy()
atexit.register(strategy._extended._collective_ops._pool.close) # type: ignore
Which should prevent the error for now (until there is a fix).
This happens in TF 2. 7 too with python 3.9
I think it's because MirroredStrategy creates a multiprocessing ThreadPool, but doesn't close it before the program ends, so its resources aren't properly cleaned up and it errors on shutdown.
You can explicitly close the pool on exit using:
import atexit .... strategy = tf.distribute.MirroredStrategy() atexit.register(strategy._extended._collective_ops._pool.close) # type: ignoreWhich should prevent the error for now (until there is a fix).
This works for me, thank you!
For me in TF 2.5.0 the problem is hardware-dependant. It is present with V100, but not with 2080 Ti.
This happens in TF 2. 7 too with python 3.9
I think it's because MirroredStrategy creates a multiprocessing ThreadPool, but doesn't close it before the program ends, so its resources aren't properly cleaned up and it errors on shutdown.
You can explicitly close the pool on exit using:
import atexit .... strategy = tf.distribute.MirroredStrategy() atexit.register(strategy._extended._collective_ops._pool.close) # type: ignoreWhich should prevent the error for now (until there is a fix).
The same issue occurs with the MultiWorkerMirroredStrategy (when using it on one machine as recommended here), on Python 3.9.10 and tf 2.7
The fix is basically the same as this one, but you have to close two pools:
strategy = tf.distribute.MultiWorkerMirroredStrategy()
atexit.register(strategy._extended._cross_device_ops._pool.close) # type: ignore
atexit.register(strategy._extended._host_cross_device_ops._pool.close) #type: ignore
This happens in TF 2. 7 too with python 3.9 I think it's because MirroredStrategy creates a multiprocessing ThreadPool, but doesn't close it before the program ends, so its resources aren't properly cleaned up and it errors on shutdown. You can explicitly close the pool on exit using:
import atexit .... strategy = tf.distribute.MirroredStrategy() atexit.register(strategy._extended._collective_ops._pool.close) # type: ignoreWhich should prevent the error for now (until there is a fix).
The same issue occurs with the
MultiWorkerMirroredStrategy(when using it on one machine as recommended here), on Python3.9.10and tf2.7The fix is basically the same as this one, but you have to close two pools:
strategy = tf.distribute.MultiWorkerMirroredStrategy() atexit.register(strategy._extended._cross_device_ops._pool.close) # type: ignore atexit.register(strategy._extended._host_cross_device_ops._pool.close) #type: ignore
I use python3.8 and tf 2.8, this problem happens too. So I try to close pools, but it doesn't work.
my code:
from tensorflow.python.distribute.cross_device_ops import AllReduceCrossDeviceOps
......
dist_strategy = tf.distribute.MirroredStrategy(
devices=["GPU:" + str(x) for x in range(FLAGS.n_gpus)],
cross_device_ops=AllReduceCrossDeviceOps('nccl', num_packs=FLAGS.n_gpus))
if I use atexit.register(dist_strategy._extended._collective_ops._pool.close) ,it doesn't work;
if I use atexit.register(dist_strategy._extended._cross_device_ops._pool.close), it raises error:'AllReduceCrossDeviceOps' object has no attribute '_pool'
what else can I do...
I can report this happens with tensorflow 2.7.0 / python 3.8 on power pc.
The solution of @tekumara worked for me as well!
Same issue in tf v2.6.
OSErroron program exit ifstrategy.scope()is called within a function.The following code causes
OSErroron exit.import tensorflow as tf def main(): strategy = tf.distribute.MirroredStrategy() print(f'\nNumber of devices: {strategy.num_replicas_in_sync}\n') with strategy.scope(): model = tf.keras.Sequential([tf.keras.layers.Dense(10)]) model.compile( loss=tf.keras.losses.MSE, optimizer=tf.keras.optimizers.Adam(), metrics=['accuracy'] ) print('\nDONE\n') if __name__ == '__main__': main()with the following output:
2021-08-27 12:00:25.516889: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2021-08-27 12:00:32.832857: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0. 2021-08-27 12:00:32.832944: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9659 MB memory: -> device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:1c:00.0, compute capability: 7.5 2021-08-27 12:00:32.834864: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0. 2021-08-27 12:00:32.834898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 9659 MB memory: -> device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:1d:00.0, compute capability: 7.5 Number of devices: 2 DONE Exception ignored in: <function Pool.__del__ at 0x7fbecd304040> Traceback (most recent call last): File "/miniconda3/envs/test/lib/python3.8/multiprocessing/pool.py", line 268, in __del__ self._change_notifier.put(None) File "/miniconda3/envs/test/lib/python3.8/multiprocessing/queues.py", line 368, in put self._writer.send_bytes(obj) File "/miniconda3/envs/test/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes self._send_bytes(m[offset:offset + size]) File "/miniconda3/envs/test/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes self._send(header + buf) File "/miniconda3/envs/test/lib/python3.8/multiprocessing/connection.py", line 368, in _send n = write(self._handle, buf) OSError: [Errno 9] Bad file descriptorWhereas the one below is fine
import tensorflow as tf strategy = tf.distribute.MirroredStrategy() print(f'\nNumber of devices: {strategy.num_replicas_in_sync}\n') with strategy.scope(): model = tf.keras.Sequential([tf.keras.layers.Dense(10)]) model.compile( loss=tf.keras.losses.MSE, optimizer=tf.keras.optimizers.Adam(), metrics=['accuracy'] ) print('\nDONE\n')Also tested the same code snippet with tf v2.4 and it ran fine in both cases.
Thank you, it worked
Confirmed getting the same with tf 2.9.1 with keras tuner where its included in the tuner model:
self.tuner = kt.tuners.Hyperband( ... distribution_strategy=tf.distribute.MirroredStrategy(), ... )
I tried changing the MirroredStrategy to OneDeviceStrategy and the exception went away. So, not sure if it is an issue caused by both combination of python and TF problems.
Hi, @npanpaliya can you share where did you modified to change MirroredStrategy to OneDeviceStrategy . Thnaks .
In my case, I'd to specify --distribution_strategy=one_device here in my tests https://github.com/open-ce/tensorflow-feedstock/blob/main/tests/open-ce-tests.yaml#L22
In my case, I'd to specify
--distribution_strategy=one_devicehere in my tests https://github.com/open-ce/tensorflow-feedstock/blob/main/tests/open-ce-tests.yaml#L22
@npanpaliya I'm using tensorflow model -garden, and tired your way to add strategy but tat parameter is not allowed in my case.
This happens in TF 2. 7 too with python 3.9
I think it's because MirroredStrategy creates a multiprocessing ThreadPool, but doesn't close it before the program ends, so its resources aren't properly cleaned up and it errors on shutdown.
You can explicitly close the pool on exit using:
import atexit .... strategy = tf.distribute.MirroredStrategy() atexit.register(strategy._extended._collective_ops._pool.close) # type: ignoreWhich should prevent the error for now (until there is a fix).
@tekumara could you please tell me in which Script should I add these code? I tried in the pool.py file but not work.
Which scripts of tensorflow/models are you trying?
@npanpaliya I'm training Bert using the run_pretraining.py https://github.com/suchunxie/models/blob/master/official/nlp/bert/run_pretraining.py here, and got the error of Bad descriptor. Then I referenced the post of yours, changed the python3.8/multiprocessing/pool.py file where shows the error.(see the picture below) [image: error.png] (My environment is Ubuntu+docker+nvidia-tensorflow container. )
@suchunxie - You can specify strategy here https://github.com/suchunxie/models/blob/master/official/nlp/bert/run_pretraining.py#L207. "one_device" is supported https://github.com/suchunxie/models/blob/65e571fdc903873362e59abe0aeec5c8018da750/official/common/distribute_utils.py#L158.
Hi, @npanpaliya It workes! I tried this way before but not worked, and after you pointed to me I checked it again, found there's a back slash lost before I pass --distribution_strategy. Stupid me > <. Thanks greatly for your help !
Hi @suchunxie, This is great! Glad to hear this! :)
It seems that a fix is submitted https://github.com/tensorflow/tensorflow/issues/56279#issuecomment-1151621844 and users need to wait for tf 2.10 release.
Hi @charliermarsh, Looks like issue is resolved with stable version Tensorflow 2.9
>>> import tensorflow
>>> def f():
... strategy = tensorflow.distribute.MirroredStrategy()
... with strategy.scope():
... tensorflow.keras.layers.Conv2D(64, (3, 3), activation="relu", padding="same")(
... tensorflow.keras.layers.Input(shape=(88, 88, 3))
... )
...
>>> f()
2022-07-29 10:47:18.249928: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.250980: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.377905: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.378958: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.379854: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.380701: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.384793: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-29 10:47:18.841990: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.842974: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.843762: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.844505: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.845247: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:18.846005: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.846699: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.847686: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.848526: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.849336: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.850099: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.850823: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13725 MB memory: -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
2022-07-29 10:47:20.854556: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-29 10:47:20.855362: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 13791 MB memory: -> device: 1, name: Tesla T4, pci bus id: 0000:00:05.0, compute capability: 7.5
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
Not for me. Using TensorFlow 2.9.1 when exiting the interpreter, it shows the exception:
In [1]: import tensorflow
...: def f():
...: strategy = tensorflow.distribute.MirroredStrategy()
...: with strategy.scope():
...: tensorflow.keras.layers.Conv2D(64, (3, 3), activation="relu", padding="same")(
...: tensorflow.keras.layers.Input(shape=(88, 88, 3))
...: )
...: f()
2022-07-29 12:54:45.169943: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-07-29 12:54:47.305006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 429 MB memory: -> device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:17:00.0, compute capability: 7.5
2022-07-29 12:54:47.305948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 9651 MB memory: -> device: 1, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:18:00.0, compute capability: 7.5
2022-07-29 12:54:47.306459: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 427 MB memory: -> device: 2, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:65:00.0, compute capability: 7.5
2022-07-29 12:54:47.306939: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 429 MB memory: -> device: 3, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:b4:00.0, compute capability: 7.5
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
In [2]:
Do you really want to exit ([y]/n)?
Exception ignored in: <function Pool.__del__ at 0x7ff160d75c10>
Traceback (most recent call last):
File "/home/user/miniconda3/lib/python3.8/multiprocessing/pool.py", line 268, in __del__
self._change_notifier.put(None)
File "/home/user/miniconda3/lib/python3.8/multiprocessing/queues.py", line 368, in put
self._writer.send_bytes(obj)
File "/home/user/miniconda3/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/home/user/miniconda3/lib/python3.8/multiprocessing/connection.py", line 411, in _send_bytes
self._send(header + buf)
File "/home/user/miniconda3/lib/python3.8/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
OSError: [Errno 9] Bad file descriptor
This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.
Hi @ZJaume, Could you share the system configuration, I am not able to replicate the issue. Thank you!
Hi, sorry for the inconvenience but now I've tried with a fresh new virtual environment and the error just disappeared, so I think the issue can be closed. The virtual environment that is throwing the exception has had many different tensorflow versions from 2.3 to 2.9. Maybe some outdated dependency is causing the error.
In case you want to reproduce it my versions are:
Tensorflow version: 2.9.1
Python version: 2.8.13
OS: Ubuntu 18.04
And the output of pip freeze:
absl-py==1.1.0
aiohttp==3.8.1
aiosignal==1.2.0
antlr4-python3-runtime==4.8
astunparse==1.6.3
async-timeout==4.0.1
atomicwrites==1.4.0
attrs==21.2.0
backcall==0.2.0
bitarray==2.3.7
blessed==1.19.0
cachetools==4.2.4
certifi==2021.10.8
cffi==1.15.0
charset-normalizer==2.0.7
clang==5.0
click==8.0.3
colorama==0.4.4
Cython==0.29.24
dataclasses==0.6
datasets==1.16.1
decorator==5.1.0
dill==0.3.4
enlighten==1.10.1
fairseq==0.10.2
fastspell==0.1.5
fasttext==0.9.2
filelock==3.3.2
flatbuffers==1.12
frozenlist==1.2.0
fsspec==2021.11.1
ftfy==6.1.1
fuzzywuzzy==0.18.0
gast==0.4.0
gensim==4.1.2
google-auth==1.35.0
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
grpcio==1.41.1
h5py==3.1.0
hanzidentifier==1.0.2
huggingface-hub==0.1.0
hunspell==0.5.5
hydra-core==1.1.1
idna==3.3
importlib-resources==5.4.0
ipython==7.29.0
jedi==0.18.0
joblib==0.14.1
keras==2.9.0
Keras-Preprocessing==1.1.2
latexcodec==2.0.1
libclang==13.0.0
Markdown==3.3.4
matplotlib-inline==0.1.3
monocleaner==1.0
more-itertools==8.10.0
mtdata==0.3.1
multidict==5.2.0
multiprocess==0.70.12.2
nltk==3.6.5
numpy==1.23.0
oauthlib==3.1.1
omegaconf==2.1.1
opt-einsum==3.3.0
packaging==21.2
pandas==1.3.5
parso==0.8.2
pexpect==4.8.0
pickleshare==0.7.5
Pillow==8.4.0
pluggy==0.13.1
portalocker==2.3.0
prefixed==0.3.2
prompt-toolkit==3.0.22
protobuf==3.19.1
psutil==5.8.0
ptyprocess==0.7.0
py==1.10.0
pyarrow==6.0.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pybind11==2.8.1
pybtex==0.24.0
pycld2==0.31
pycparser==2.21
Pygments==2.10.0
pyparsing==2.4.7
pypinyin==0.46.0
pytest==5.1.2
python-dateutil==2.8.2
python-Levenshtein==0.12.2
pytz==2021.3
PyYAML==5.4.1
regex==2022.3.2
requests==2.26.0
requests-oauthlib==1.3.0
rsa==4.7.2
ruamel.yaml==0.17.17
ruamel.yaml.clib==0.2.6
sacrebleu==2.1.0
sacremoses==0.0.43
scikit-learn==0.22.1
scipy==1.4.1
sentence-transformers==2.1.0
sentencepiece==0.1.94
six==1.15.0
smart-open==5.2.1
tabulate==0.8.9
tensorboard==2.9.1
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.0
tensorflow==2.9.1
tensorflow-estimator==2.9.0
tensorflow-io-gcs-filesystem==0.24.0
termcolor==1.1.0
tf-estimator-nightly==2.8.0.dev2021122109
threadpoolctl==3.0.0
tokenizers==0.12.1
toolwrapper==0.4.1
torch==1.10.1
torch-train==0.0.3
torchsummary==1.5.1
torchvision==0.11.2
tqdm==4.62.3
traitlets==5.1.1
transformers==4.20.1
typing-extensions==3.7.4.3
Unidecode==1.2.0
urllib3==1.26.7
wcwidth==0.2.5
Werkzeug==2.0.2
wrapt==1.12.1
xxhash==2.0.2
yarl==1.7.2
zhon==1.1.5
zipp==3.7.0
Closing as stale. Please reopen if you'd like to work on this further.