returnn icon indicating copy to clipboard operation
returnn copied to clipboard

multiprocessing: OSError: AF_UNIX path too long

Open michelwi opened this issue 7 months ago • 11 comments

Yesterday I started a training with DistributeFilesDataset and file caching which today crashed and consistently crashes after restarting with what I think is OSError: AF_UNIX path too long in the _TouchFilesThread.run.

The original files reside on a ceph cluster and are cached to an LVM volume with ext4 filesystem. The basename should be within reasonable length but the absolute path can be quite long.

I try to attach one version of each flavor of traces in order of appearence:

OSError
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.10/multiprocessing/managers.py", line 591, in _run_server
    server = cls._Server(registry, address, authkey, serializer)
  File "/usr/lib/python3.10/multiprocessing/managers.py", line 156, in __init__
    self.listener = Listener(address=address, backlog=16)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 448, in __init__
    self._listener = SocketListener(address, family, backlog)
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 591, in __init__
    self._socket.bind(address)
OSError: AF_UNIX path too long
_TouchFilesThread
Unhandled exception <class 'EOFError'> in thread <_MainThread(MainThread, started 139642623541696)>, proc 96008.

Thread <_TouchFilesThread(Thread-1, started daemon 139638043833920)>:
Process SyncManager-1:
  File "/usr/lib/python3.10/threading.py", line 324, in Condition.wait
    line: gotit = waiter.acquire(True, timeout)
    locals:
      gotit = <local> False
      waiter = <local> <locked _thread.lock object at 0x7f6b43d94640>
      waiter.acquire = <local> <built-in method acquire of _thread.lock object at 0x7f6b43d94640>
      timeout = <local> 1.0
  File "/usr/lib/python3.10/threading.py", line 607, in Event.wait
    line: signaled = self._cond.wait(timeout)
    locals:
      signaled = <local> False
      self = <local> <threading.Event object at 0x7f6b43d661d0>
      self._cond = <local> <Condition(<unlocked _thread.lock object at 0x7f6b43d85e80>, 1)>
      self._cond.wait = <local> <bound method Condition.wait of <Condition(<unlocked _thread.lock object at 0x7f6b43d85e80>, 1)>>
      timeout = <local> 1.0
  File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/util/file_cache.py", line 418, in _TouchFilesThread.run
    line: if self.stop.wait(self.interval):
    locals:
      self = <local> <_TouchFilesThread(Thread-1, started daemon 140098675930688)>
      self.stop = <local> <threading.Event object at 0x7f6b43d661d0>
      self.stop.wait = <local> <bound method Event.wait of <threading.Event object at 0x7f6b43d661d0>>
      self.interval = <local> 1.0
  File "/usr/lib/python3.10/threading.py", line 1016, in Thread._bootstrap_inner
    line: self.run()
    locals:
      self = <local> <_TouchFilesThread(Thread-1, started daemon 140098675930688)>
      self.run = <local> <bound method _TouchFilesThread.run of <_TouchFilesThread(Thread-1, started daemon 140098675930688)>>
  File "/usr/lib/python3.10/threading.py", line 973, in Thread._bootstrap
    line: self._bootstrap_inner()
    locals:
      self = <local> <_TouchFilesThread(Thread-1, started daemon 140098675930688)>
      self._bootstrap_inner = <local> <bound method Thread._bootstrap_inner of <_TouchFilesThread(Thread-1, started daemon 140098675930688)>>

Thread current, main, <_MainThread(MainThread, started 140103104491968)>:
(Excluded thread.)
Main
EXCEPTION
Traceback (most recent call last):
  File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/rnn.py", line 11, in <module>
    line: main()
    locals:
      main = <local> <function main at 0x7fd88b3a27a0>
  File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/__main__.py", line 735, in main
    line: init(command_line_options=argv[1:])
    locals:
      init = <global> <function init at 0x7fd88b3a24d0>
      command_line_options = <not found>
      argv = <local> ['/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/rnn.py', '/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/returnn/training/ReturnnTrainingJob.V7zBp10cMkXR/output/returnn.config'], _[0]: {len = 142}
  File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/__main__.py", line 487, in init
    line: init_engine()
    locals:
      init_engine = <global> <function init_engine at 0x7fd88b3a2320>
  File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/__main__.py", line 294, in init_engine
    line: engine = Engine(config=config)
    locals:
      engine = <global> None
      Engine = <local> <class 'returnn.torch.engine.Engine'>
      config = <global> <returnn.config.Config object at 0x7fd88b333fd0>
  File "/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/returnn/torch/engine.py", line 60, in Engine.__init__
    line: self._mp_manager = torch.multiprocessing.Manager()
    locals:
      self = <local> <returnn.torch.engine.Engine object at 0x7fd79384aa40>
      self._mp_manager = <local> !AttributeError: 'Engine' object has no attribute '_mp_manager'
      torch = <global> <module 'torch' from '/usr/local/lib/python3.10/dist-packages/torch/__init__.py'>
      torch.multiprocessing = <global> <module 'torch.multiprocessing' from '/usr/local/lib/python3.10/dist-packages/torch/multiprocessing/__init__.py'>
      torch.multiprocessing.Manager = <global> <bound method BaseContext.Manager of <multiprocessing.context.DefaultContext object at 0x7fd7a1d772e0>>
  File "/usr/lib/python3.10/multiprocessing/context.py", line 57, in BaseContext.Manager
    line: m.start()
    locals:
      m = <local> <multiprocessing.managers.SyncManager object at 0x7fd88b56d930>
      m.start = <local> <bound method BaseManager.start of <multiprocessing.managers.SyncManager object at 0x7fd88b56d930>>
  File "/usr/lib/python3.10/multiprocessing/managers.py", line 566, in BaseManager.start
    line: self._address = reader.recv()
    locals:
      self = <local> <multiprocessing.managers.SyncManager object at 0x7fd88b56d930>
      self._address = <local> None
      reader = <local> <multiprocessing.connection.Connection object at 0x7fd88b56fb80>
      reader.recv = <local> <bound method _ConnectionBase.recv of <multiprocessing.connection.Connection object at 0x7fd88b56fb80>>
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 250, in _ConnectionBase.recv
    line: buf = self._recv_bytes()
    locals:
      buf = <not found>
      self = <local> <multiprocessing.connection.Connection object at 0x7fd88b56fb80>
      self._recv_bytes = <local> <bound method Connection._recv_bytes of <multiprocessing.connection.Connection object at 0x7fd88b56fb80>>
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 414, in Connection._recv_bytes
    line: buf = self._recv(4)
    locals:
      buf = <not found>
      self = <local> <multiprocessing.connection.Connection object at 0x7fd88b56fb80>
      self._recv = <local> <bound method Connection._recv of <multiprocessing.connection.Connection object at 0x7fd88b56fb80>>
  File "/usr/lib/python3.10/multiprocessing/connection.py", line 383, in Connection._recv
    line: raise EOFError
    locals:
      EOFError = <builtin> <class 'EOFError'>
EOFError
Unhandled exception <class 'EOFError'> in thread <_MainThread(MainThread, started 139836318601664)>, proc 96015.
torch.distributed
E0703 09:18:03.509000 139702419329472 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 96008) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 883, in <module>
    main()
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/nas/models/asr/am/EN/16kHz/20240625_wmichel_en_model_training/work/i6_core/tools/git/CloneGitRepositoryJob.h2Y3KqeU0C4d/output/returnn/rnn.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-07-03_09:18:03
  host      : g-28.apptek.local
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 96009)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-07-03_09:18:03
  host      : g-28.apptek.local
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 96010)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2024-07-03_09:18:03
  host      : g-28.apptek.local
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 96011)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[4]:
  time      : 2024-07-03_09:18:03
  host      : g-28.apptek.local
  rank      : 4 (local_rank: 4)
  exitcode  : 1 (pid: 96012)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[5]:
  time      : 2024-07-03_09:18:03
  host      : g-28.apptek.local
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 96013)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[6]:
  time      : 2024-07-03_09:18:03
  host      : g-28.apptek.local
  rank      : 6 (local_rank: 6)
  exitcode  : 1 (pid: 96014)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[7]:
  time      : 2024-07-03_09:18:03
  host      : g-28.apptek.local
  rank      : 7 (local_rank: 7)
  exitcode  : 1 (pid: 96015)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-03_09:18:03
  host      : g-28.apptek.local
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 96008)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

EDIT: fixed torch.distributed trace

michelwi avatar Jul 03 '24 09:07 michelwi