exo icon indicating copy to clipboard operation
exo copied to clipboard

[BUG] Potential race condition with multiple instances with hosts / devices file

Open AlexCheema opened this issue 2 months ago • 3 comments

Describe the bug

In mlx_distributed_init we create a devices file (in the case of MlxJaccl) or a hosts file (in the case of MlxRing). The name we use for this file is hosts_{rank}.json. This can mean the file names can clash. This is problematic if you have multiple instances, especially if they both initialize around the same time.

To Reproduce

(Not verified, but as a guess) Steps to reproduce the behavior:

  1. Create 2 instances in quick succession with 2 nodes.
  2. Wait for both to initialize together.
  3. Assuming they share the same rank for the current node, they both end up in a failure loop where they overwrite each other's devices/host file.

Expected behavior

Both instances should be allowed to initialize concurrently without overwriting each other's devices / hosts file.

We should change the name to be unique for each instance (or even each initialization could be a new file).

Actual behavior

We see in the logs this error (see for example bug-report: reports/2025-12-24T17:21:53Z which is from #1004 ):

[ 2025-12-24 13:04:46.001 | WARNING  | exo.worker.runner.runner:main:249 ] Runner 079d339e-5e39-4b45-99d1-514f529976fb crashed with critical exception [jaccl] Malformed device file
Traceback (most recent call last):

  File "__main__.py", line 38, in <module>

  File "pyi_rth_multiprocessing.py", line 48, in _freeze_support

  File "multiprocessing/spawn.py", line 122, in spawn_main

  File "multiprocessing/spawn.py", line 135, in _main

  File "multiprocessing/process.py", line 313, in _bootstrap

  File "multiprocessing/process.py", line 108, in run

  File "exo/worker/runner/bootstrap.py", line 35, in entrypoint

> File "exo/worker/runner/runner.py", line 96, in main

  File "exo/worker/engines/mlx/utils_mlx.py", line 204, in initialize_mlx

  File "exo/worker/engines/mlx/utils_mlx.py", line 188, in mlx_distributed_init

RuntimeError: [jaccl] Malformed device file

Environment

  • macOS Version: 26.2
  • EXO Version: 1.0.59
  • Hardware:
    • Device 1: 48GB M4 Pro Mac Mini
    • Device 2: 64GB M4 Max Mac Studio
  • Interconnection:
    • Thunderbolt 5 between Device 1 and Device 2
    • Ethernet between Device 1 and Device 2

AlexCheema avatar Dec 24 '25 18:12 AlexCheema