exo icon indicating copy to clipboard operation
exo copied to clipboard

[BUG] After one node went offline, all the computers became unresponsive with their CPUs running at 100% utilization.

Open VoidMore opened this issue 2 months ago • 2 comments

Describe the bug

After successfully running DeepSeek, during the inference process, one node went offline. The other Mac Studios then became stuck with 100% CPU usage, and killing the process didn't work. Only restarting the machines restored normal operation.

To Reproduce

Steps to reproduce the behavior:

  1. Four Mac Studio computers are running EXO.
  2. After successfully loading DeepSeek, proceed with inference.
  3. One of the nodes went offline, the inference process was interrupted, and then it returned to a RUNNING state, while the CPUs of the other three Mac Studio machines remained at 100% utilization.

Expected behavior

When a node goes offline, the model inference should be interrupted, the system should return to the READY state, and the other Mac Studios should resume their normal CPU state.

Actual behavior

After one node went offline, all other nodes experienced 100% CPU utilization, and killing the processes didn't help. The only way to restore normal operation was to restart the Mac Studio.

Environment

  • macOS Version: 26.2
  • EXO Version: latest
  • Hardware:
    • Device 1: Mac Studio M3 Ultra 512G
    • Device 2: Mac Studio M3 Ultra 512G
    • Device 3: Mac Studio M3 Ultra 512G
    • Device 4: Mac Studio M3 Ultra 512G
    • Additional devices:
  • Interconnection:
    • Thunderbolt 5 cable between Device 1 and 2 and 3 and 4

Additional context

Add any other context about the problem here.

VoidMore avatar Jan 07 '26 02:01 VoidMore

This is the infamous "GPU Lock" - it's an MLX/OS issue caused by RDMA, and can only be fixed by restarting the devices in question. We do our best to prevent it, but killing processes indiscriminately or a device shutting down mid generation is a surefire way to cause it.

Can you tell us what caused the node to go offline?

Evanev7 avatar Jan 07 '26 11:01 Evanev7

This is the infamous "GPU Lock" - it's an MLX/OS issue caused by RDMA, and can only be fixed by restarting the devices in question. We do our best to prevent it, but killing processes indiscriminately or a device shutting down mid generation is a surefire way to cause it.这就是臭名昭著的“GPU 锁定”——这是由 RDMA 引起的 MLX/操作系统问题,只能通过重启相关设备来解决。我们尽力避免这种情况的发生,但随意终止进程或设备在迭代过程中突然关机是导致此问题的一个必然方法。

Can you tell us what caused the node to go offline?请问是什么原因导致节点离线?

It may be caused by unstable operation or network factors. It may occur occasionally.

VoidMore avatar Jan 09 '26 09:01 VoidMore