returnn PyTorch distributed training, could not unlink the shared memory file

PyTorch distributed training, could not unlink the shared memory file

Open albertz opened this issue 1 year ago • 0 comments

[2023-12-31 11:33:54,580] INFO: Start Job: Job<alias/exp2023_04_25_rf/aed/v6-11gb-f32-bs15k-accgrad1-mgpu4-pavg100-wd1e_4-lrlin1e_5_100k-speedpertV2/train work/i6_core/returnn/training/ReturnnTrainingJob.EImqFihsdh2B> Task: run
...
RETURNN starting up, version 1.20231230.164342+git.f353135e, date/time 2023-12-31-11-34-07 (UTC+0000), pid 1868636, cwd /work/asr4/zeyer/setups-data/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.EImqFihsdh2B/work, Python /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11RETURNN command line options: ['/u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.EImqFihsdh2B/output/returnn.config']
Hostname: cn-237
...
ep 301 train, step 1177, ce 0.310, ctc_4 0.773, ctc_8 0.367, fer 0.043, mem_usage:cuda:3 7.6GB
ep 301 train, step 1178, ce 0.305, ctc_4 0.668, ctc_8 0.318, fer 0.052, mem_usage:cuda:2 7.6GB
ep 301 train, step 1178, ce 0.394, ctc_4 0.727, ctc_8 0.431, fer 0.067, mem_usage:cuda:0 7.6GB
ep 301 train, step 1178, ce 0.347, ctc_4 0.528, ctc_8 0.338, fer 0.044, mem_usage:cuda:1 7.5GB
ep 301 train, step 1178, ce 0.437, ctc_4 0.769, ctc_8 0.555, fer 0.078, mem_usage:cuda:3 7.6GB
Traceback (most recent call last):
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/queues.py", line 244, in _feed
    obj = _ForkingPickler.dumps(obj)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/tools/users/zeyer/linuxbrew/opt/[email protected]/lib/python3.11/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 428, in reduce_storage
    fd, size = storage._share_fd_cpu_()
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/storage.py", line 297, in wrapper
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/storage.py", line 330, in _share_fd_cpu_
    return super()._share_fd_cpu_(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: could not unlink the shared memory file /torch_1871018_294927500_44734 : No such file or directory (2)
...
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete 
[2023-12-31 13:51:47,355] INFO: Run time: 2:17:52 CPU: 0.80% RSS: 34.63GB VMS: 475.21GB 
[2023-12-31 13:51:52,378] INFO: Run time: 2:17:57 CPU: 0.60% RSS: 21.74GB VMS: 347.22GB
[2023-12-31 13:51:53,986] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1868635 closing signal SIGTERM 
[2023-12-31 13:51:54,211] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 1868636) of binary: /work/tools

Dec 31 '23 17:12 albertz

returnn returnn copied to clipboard

PyTorch distributed training, could not unlink the shared memory file

returnn
returnn copied to clipboard