exo [BUG] All models get stuck on WARMING UP with pipeline/RDMA

Describe the bug

When launching an instance of any model with pipeline and RDMA, it gets stuck on WARMING UP.

To Reproduce

Steps to reproduce the behavior:

Launch an instance of any model with pipeline and RDMA
It will get stuck on WARMING UP

Expected behavior

Instance should pass warm up and reach READY state.

Actual behavior

Gets stuck in WARMING UP. Logs don't show any tokens generated at all, which indicates it's probably stuck in communicating. Perhaps an ordering issue.

Environment

macOS Version: 26.3
EXO Version: Latest main 007eb8002919182e3c2149c7a089ef8f44ffcab4
Hardware:
- 2 x 512GB M3 Ultra
Interconnection:
- TB5 + Ethernet switch (all-to-all)

Annoying thing about this is it gets stuck communicating, i.e. GPU shows 100% utilization, which then ends up with a GPU lock when you kill exo.

Jan 12 '26 23:01 AlexCheema

I can reproduce a very similar issue in my setup.

Environment is almost the same (2× M3 Ultra 512GB, TB5), but my OS is macOS 26.2 and I’m using the exo 1.0.6.0 app.

In my case, RDMA cannot load the model at all (never reaches READY / no tokens). Pipeline or Tensor with RDMA both fail, while MLX Ring works with the same models.

This looks like an RDMA initialization / communication issue rather than a model-specific problem.

Jan 13 '26 02:01 aaronysl

I'm experiencing a similar issue with EXO when loading the Qwen3 30B model. The model gets stuck indefinitely during the WARMING UP phase. When attempting to delete the model through the interface, the memory is not released. Even after exiting EXO and restarting, the memory remains occupied. A complete system shutdown is required to free the GPU memory.

Jan 15 '26 01:01 aaronysl

Qwen models seem broken with MLX_FAST_SYNCH. We may just turn them off. You're saying this issue occurs in 1.0.60?

Jan 15 '26 10:01 Evanev7

Same issue, and the logs: [ 07:17:40.5323PM | INFO ] finding cycles: [ 07:17:40.5324PM | WARNING ] You have likely selected ibv for a single node instance; falling back to MlxRing [ 07:17:40.5325PM | INFO ] finding cycles: [ 07:17:40.5327PM | INFO ] Searching 12D3KooWBxRgAfy5HcKQ5Mj1GgKJmaFEegV3SzHmxYii6nC9x425 for ip 169.254.68.240: [ 07:17:40.5328PM | INFO ] | en2: 169.254.68.240 [ 07:17:40.5328PM | INFO ] Found [ 07:17:40.5328PM | INFO ] Interface name for 169.254.68.240 on 12D3KooWBxRgAfy5HcKQ5Mj1GgKJmaFEegV3SzHmxYii6nC9x425: rdma_en2 [ 07:17:40.5329PM | INFO ] Searching 12D3KooWMz2vyta6mEzd8mFeFVM4mqH39VgW24v3LnG3WijDQDP6 for ip 169.254.118.209: [ 07:17:40.5329PM | INFO ] | en3: 169.254.118.209 [ 07:17:40.5330PM | INFO ] Found [ 07:17:40.5330PM | INFO ] Interface name for 169.254.118.209 on 12D3KooWMz2vyta6mEzd8mFeFVM4mqH39VgW24v3LnG3WijDQDP6: rdma_en3

Jan 15 '26 11:01 qoyooo