exo icon indicating copy to clipboard operation
exo copied to clipboard

[BUG] All models get stuck on WARMING UP with pipeline/RDMA

Open AlexCheema opened this issue 2 months ago • 4 comments

Describe the bug

When launching an instance of any model with pipeline and RDMA, it gets stuck on WARMING UP.

To Reproduce

Steps to reproduce the behavior:

  1. Launch an instance of any model with pipeline and RDMA
  2. It will get stuck on WARMING UP

Expected behavior

Instance should pass warm up and reach READY state.

Actual behavior

Gets stuck in WARMING UP. Logs don't show any tokens generated at all, which indicates it's probably stuck in communicating. Perhaps an ordering issue.

Environment

  • macOS Version: 26.3
  • EXO Version: Latest main 007eb8002919182e3c2149c7a089ef8f44ffcab4
  • Hardware:
    • 2 x 512GB M3 Ultra
  • Interconnection:
    • TB5 + Ethernet switch (all-to-all)

Annoying thing about this is it gets stuck communicating, i.e. GPU shows 100% utilization, which then ends up with a GPU lock when you kill exo.

AlexCheema avatar Jan 12 '26 23:01 AlexCheema

I can reproduce a very similar issue in my setup.

Environment is almost the same (2Ɨ M3 Ultra 512GB, TB5), but my OS is macOS 26.2 and I’m using the exo 1.0.6.0 app.

In my case, RDMA cannot load the model at all (never reaches READY / no tokens). Pipeline or Tensor with RDMA both fail, while MLX Ring works with the same models.

This looks like an RDMA initialization / communication issue rather than a model-specific problem.

aaronysl avatar Jan 13 '26 02:01 aaronysl

I'm experiencing a similar issue with EXO when loading the Qwen3 30B model. The model gets stuck indefinitely during the WARMING UP phase. When attempting to delete the model through the interface, the memory is not released. Even after exiting EXO and restarting, the memory remains occupied. A complete system shutdown is required to free the GPU memory.

aaronysl avatar Jan 15 '26 01:01 aaronysl

Qwen models seem broken with MLX_FAST_SYNCH. We may just turn them off. You're saying this issue occurs in 1.0.60?

Evanev7 avatar Jan 15 '26 10:01 Evanev7

Same issue, and the logs: [ 07:17:40.5323PM | INFO ] finding cycles: [ 07:17:40.5324PM | WARNING ] You have likely selected ibv for a single node instance; falling back to MlxRing [ 07:17:40.5325PM | INFO ] finding cycles: [ 07:17:40.5327PM | INFO ] Searching 12D3KooWBxRgAfy5HcKQ5Mj1GgKJmaFEegV3SzHmxYii6nC9x425 for ip 169.254.68.240: [ 07:17:40.5328PM | INFO ] | en2: 169.254.68.240 [ 07:17:40.5328PM | INFO ] Found [ 07:17:40.5328PM | INFO ] Interface name for 169.254.68.240 on 12D3KooWBxRgAfy5HcKQ5Mj1GgKJmaFEegV3SzHmxYii6nC9x425: rdma_en2 [ 07:17:40.5329PM | INFO ] Searching 12D3KooWMz2vyta6mEzd8mFeFVM4mqH39VgW24v3LnG3WijDQDP6 for ip 169.254.118.209: [ 07:17:40.5329PM | INFO ] | en3: 169.254.118.209 [ 07:17:40.5330PM | INFO ] Found [ 07:17:40.5330PM | INFO ] Interface name for 169.254.118.209 on 12D3KooWMz2vyta6mEzd8mFeFVM4mqH39VgW24v3LnG3WijDQDP6: rdma_en3

qoyooo avatar Jan 15 '26 11:01 qoyooo