returnn
returnn copied to clipboard
PyTorch distributed training, could not unlink the shared memory file
[2023-12-31 11:33:54,580] INFO: Start Job: Job<alias/exp2023_04_25_rf/aed/v6-11gb-f32-bs15k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_100k-speedpertV2/train work/i6_core/returnn/training/ReturnnTrainingJob.EImqFihsdh2B> Task: run
...
RETURNN starting up, version 1.20231230.164342+git.f353135e, date/time 2023-12-31-11-34-07 (UTC+0000), pid 1868636, cwd /work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.EImqFihsdh2B/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.EImqFihsdh2B/output/returnn.config']
Hostname: cn-237
...
ep 301 train, step 1177, ce 0.310, ctc_4 0.773, ctc_8 0.367, fer 0.043, mem_usage:cuda:3 7.6GB
ep 301 train, step 1178, ce 0.305, ctc_4 0.668, ctc_8 0.318, fer 0.052, mem_usage:cuda:2 7.6GB
ep 301 train, step 1178, ce 0.394, ctc_4 0.727, ctc_8 0.431, fer 0.067, mem_usage:cuda:0 7.6GB
ep 301 train, step 1178, ce 0.347, ctc_4 0.528, ctc_8 0.338, fer 0.044, mem_usage:cuda:1 7.5GB
ep 301 train, step 1178, ce 0.437, ctc_4 0.769, ctc_8 0.555, fer 0.078, mem_usage:cuda:3 7.6GB
Traceback (most recent call last):
File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 428, in reduce_storage
fd, size = storage._share_fd_cpu_()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/storage.py", line 297, in wrapper
return fn(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/storage.py", line 330, in _share_fd_cpu_
return super()._share_fd_cpu_(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: could not unlink the shared memory file /torch_1871018_294927500_44734 : No such file or directory (2)
...
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
[2023-12-31 13:51:47,355] INFO: Run time: 2:17:52 CPU: 0.80% RSS: 34.63GB VMS: 475.21GB
[2023-12-31 13:51:52,378] INFO: Run time: 2:17:57 CPU: 0.60% RSS: 21.74GB VMS: 347.22GB
[2023-12-31 13:51:53,986] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1868635 closing signal SIGTERM
[2023-12-31 13:51:54,211] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 1868636) of binary: /work/tools