exo icon indicating copy to clipboard operation
exo copied to clipboard

[BUG] Failed to shard gpt-oss-120b-MXFP4-Q8 across 2 nodes in v1.0.62 (Regression from v1.0.60)

Open andrewwutw opened this issue 2 months ago • 5 comments

Describe the bug

In version 1.0.62, I encountered an issue where the mlx-community/gpt-oss-120b-MXFP4-Q8 model fails to load when distributed across 2 nodes. The instance status cycles repeatedly between loading -> failed -> unknown and never successfully initializes.

Notably, this setup worked correctly in version 1.0.60.

To Reproduce

  1. Open EXO Dashboard.
  2. Select model: mlx-community/gpt-oss-120b-MXFP4-Q8.
  3. Set Sharding to "Tensor".
  4. Set Instance Type to "MLX RDMA".
  5. Select 2 nodes for loading.
  6. Click "Launch Instance".

Expected behavior

The model should be partitioned and loaded successfully across the 2 nodes, as it did in version 1.0.60.

Actual behavior

The instance status fluctuates between loading, failed, and unknown, and the model fails to load.

Image

Environment

  • macOS Version: 26.2
  • EXO Version: 1.0.62 ( b9a78f6f3aa119624e9bfeac1038071de6d68e59 )
  • Hardware:
    • 4 M3 Ultra Mac Studio, 512GB RAM.
  • Interconnection:
    • 6 Thunderbolt 5 Pro cables, Fully Connected Network.

Additional context

RDMA Status: Enabled (Verified with rdma_ctl enable) It appears that this issue might be related to, or the same as, #1124.

andrewwutw avatar Jan 12 '26 07:01 andrewwutw

I’m running exo on two Mac Studios (macOS Tahoe 26.2, Thunderbolt 5, RDMA enabled).

I noticed that:

Pipeline and MLX Ring modes allow selecting 2 nodes and work as expected.

But when I select Tensor or Tensor + MLX RDMA, the UI only allows 1 node (minimum nodes is locked to 1).

Both machines can run exo individually, models are synced, and RDMA is enabled via rdma_ctl enable.

Is this a current limitation of exo’s Tensor/RDMA implementation, or is there something missing in my setup? Has anyone been able to use Tensor + RDMA with multiple nodes?

aaronysl avatar Jan 12 '26 08:01 aaronysl

I remember this being possible in version 1.0.60, and it was working fine then. Unfortunately, I can't revert to that version to verify. 😢

andrewwutw avatar Jan 12 '26 10:01 andrewwutw

@aaronysl I suggest rolling back to version 1.0.60 for now. It works for me after downgrading. You can download it from #1128

andrewwutw avatar Jan 12 '26 14:01 andrewwutw

Apologies @andrewwutw @aaronysl it looks like there were quite a few regressions in 1.0.61/1.0.62. Stick to 1.0.60 for now and we will fix these in 1.0.63.

AlexCheema avatar Jan 12 '26 23:01 AlexCheema

The latest app(https://exolabs.net/) now supports 2-node operation with both Tensor and RDMA. Note: on Mac Studio, not every Thunderbolt 5 port enables RDMA—try each one. I’ve validated the newest build; feel free to test it yourself.

aaronysl avatar Jan 14 '26 01:01 aaronysl

Should be fixed in the next build!

Evanev7 avatar Jan 14 '26 16:01 Evanev7