[BUG] Failed to shard gpt-oss-120b-MXFP4-Q8 across 2 nodes in v1.0.62 (Regression from v1.0.60)
Describe the bug
In version 1.0.62, I encountered an issue where the mlx-community/gpt-oss-120b-MXFP4-Q8 model fails to load when distributed across 2 nodes. The instance status cycles repeatedly between loading -> failed -> unknown and never successfully initializes.
Notably, this setup worked correctly in version 1.0.60.
To Reproduce
- Open EXO Dashboard.
- Select model:
mlx-community/gpt-oss-120b-MXFP4-Q8. - Set Sharding to "Tensor".
- Set Instance Type to "MLX RDMA".
- Select 2 nodes for loading.
- Click "Launch Instance".
Expected behavior
The model should be partitioned and loaded successfully across the 2 nodes, as it did in version 1.0.60.
Actual behavior
The instance status fluctuates between loading, failed, and unknown, and the model fails to load.
Environment
- macOS Version: 26.2
- EXO Version: 1.0.62 ( b9a78f6f3aa119624e9bfeac1038071de6d68e59 )
- Hardware:
- 4 M3 Ultra Mac Studio, 512GB RAM.
- Interconnection:
- 6 Thunderbolt 5 Pro cables, Fully Connected Network.
Additional context
RDMA Status: Enabled (Verified with rdma_ctl enable)
It appears that this issue might be related to, or the same as, #1124.
I’m running exo on two Mac Studios (macOS Tahoe 26.2, Thunderbolt 5, RDMA enabled).
I noticed that:
Pipeline and MLX Ring modes allow selecting 2 nodes and work as expected.
But when I select Tensor or Tensor + MLX RDMA, the UI only allows 1 node (minimum nodes is locked to 1).
Both machines can run exo individually, models are synced, and RDMA is enabled via rdma_ctl enable.
Is this a current limitation of exo’s Tensor/RDMA implementation, or is there something missing in my setup? Has anyone been able to use Tensor + RDMA with multiple nodes?
I remember this being possible in version 1.0.60, and it was working fine then. Unfortunately, I can't revert to that version to verify. 😢
@aaronysl I suggest rolling back to version 1.0.60 for now. It works for me after downgrading. You can download it from #1128
Apologies @andrewwutw @aaronysl it looks like there were quite a few regressions in 1.0.61/1.0.62. Stick to 1.0.60 for now and we will fix these in 1.0.63.
The latest app(https://exolabs.net/) now supports 2-node operation with both Tensor and RDMA. Note: on Mac Studio, not every Thunderbolt 5 port enables RDMA—try each one. I’ve validated the newest build; feel free to test it yourself.
Should be fixed in the next build!