exo icon indicating copy to clipboard operation
exo copied to clipboard

mlx: update to 0.30.1 and align coordinator naming with MLX conventions

Open JakeHillion opened this issue 2 months ago • 0 comments

The Jaccl distributed backend requires MLX 0.30.1+, which includes the RDMA over Thunderbolt support. The previous minimum version (0.29.3) would fail at runtime with "The only valid values for backend are 'any', 'mpi' and 'ring' but 'jaccl' was provided."

Bump MLX dependency to >=0.30.1 and rename ibv_coordinators to jaccl_coordinators to match MLX's naming conventions. This includes the environment variable change from MLX_IBV_COORDINATOR to MLX_JACCL_COORDINATOR.

Test plan:

Hardware setup: 3x Mac Studio M3 Ultra connected all-to-all with TB5

  • Built a DMG [0]
  • Installed on all Macs and started cluster.
  • Requested a 2 node Tensor + MLX RDMA instance of Llama 3.3 70B (FP16).
  • It started successfully.
  • Queried the chat a few times. All was good. This didn't work previously.
  • Killed the instance and spawned Pipeline + MLX Ring Llama 3.3 70B (FP16). Also started succesfully on two nodes and could be queried.

Still not working:

  • Pipeline + MLX Ring on 3 nodes is failing. Haven't debugged that yet.

[0] https://github.com/exo-explore/exo/actions/runs/20467656904/job/58815275013

JakeHillion avatar Dec 23 '25 19:12 JakeHillion