exo
exo copied to clipboard
[BUG] Potential race condition with multiple instances with hosts / devices file
Describe the bug
In mlx_distributed_init we create a devices file (in the case of MlxJaccl) or a hosts file (in the case of MlxRing). The name we use for this file is hosts_{rank}.json. This can mean the file names can clash. This is problematic if you have multiple instances, especially if they both initialize around the same time.
To Reproduce
(Not verified, but as a guess) Steps to reproduce the behavior:
- Create 2 instances in quick succession with 2 nodes.
- Wait for both to initialize together.
- Assuming they share the same rank for the current node, they both end up in a failure loop where they overwrite each other's devices/host file.
Expected behavior
Both instances should be allowed to initialize concurrently without overwriting each other's devices / hosts file.
We should change the name to be unique for each instance (or even each initialization could be a new file).
Actual behavior
We see in the logs this error (see for example bug-report: reports/2025-12-24T17:21:53Z which is from #1004 ):
[ 2025-12-24 13:04:46.001 | WARNING | exo.worker.runner.runner:main:249 ] Runner 079d339e-5e39-4b45-99d1-514f529976fb crashed with critical exception [jaccl] Malformed device file
Traceback (most recent call last):
File "__main__.py", line 38, in <module>
File "pyi_rth_multiprocessing.py", line 48, in _freeze_support
File "multiprocessing/spawn.py", line 122, in spawn_main
File "multiprocessing/spawn.py", line 135, in _main
File "multiprocessing/process.py", line 313, in _bootstrap
File "multiprocessing/process.py", line 108, in run
File "exo/worker/runner/bootstrap.py", line 35, in entrypoint
> File "exo/worker/runner/runner.py", line 96, in main
File "exo/worker/engines/mlx/utils_mlx.py", line 204, in initialize_mlx
File "exo/worker/engines/mlx/utils_mlx.py", line 188, in mlx_distributed_init
RuntimeError: [jaccl] Malformed device file
Environment
- macOS Version: 26.2
- EXO Version: 1.0.59
- Hardware:
- Device 1: 48GB M4 Pro Mac Mini
- Device 2: 64GB M4 Max Mac Studio
- Interconnection:
- Thunderbolt 5 between Device 1 and Device 2
- Ethernet between Device 1 and Device 2