[BUG] Mac Studio M3 Ultra uses TB5 interconnect, but cannot use RDMA.
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
Steps to reproduce the behavior:
- I used a TB5 link, connecting A->B, A->C, A->D, B->C, B->D, and C->D. Then, on each Mac Studio machine, I entered recovery mode, typed "rdma_ctl enable" in Terminal, and restarted. I then ran
uv run exoon each Mac, but MLX RDMA was not working. What could be the reason?
Expected behavior
The four Mac Studio instances can connect to each other via RDMA.
Actual behavior
I used a TB5 link, connecting A->B, A->C, A->D, B->C, B->D, and C->D. Then, on each Mac Studio machine, I entered recovery mode, typed "rdma_ctl enable" in Terminal, and restarted. I then ran uv run exo on each Mac, but MLX RDMA was not working. What could be the reason?
Environment
- macOS Version: 26.2
- EXO Version: 2.0
- Hardware:
- Device 1: Mac Studio M3 Ultra 512 RAM
- Device 2: Mac Studio M3 Ultra 512 RAM
- Device 3: Mac Studio M3 Ultra 512 RAM
- Device 4: Mac Studio M3 Ultra 512 RAM
- Additional devices:
- Interconnection:
- device 1->device 2, device 1->device 3, device 1-> device 4
- device 2->device 3, device 2->device 4
- device 3->device 4.
Additional context
Add any other context about the problem here.
Hi runfeng, I appreciate you're having trouble with this but please don't go into other issues to ask for help that aren't relevant to the problem you're encountering.
Try increasing the minimum nodes from 1 to 4 - your topology looks correct, so you should at least be able to place an instance.
I can reproduce this issue and Increasing to 4 doesn't help on my end. I was puzzled by the same issue and I frequently checked this issue board in the past 2 days to see if someone encountered the same issue and here you go. I have the same settings and RDMA is active, but it says
I do have M3U 512GB Mac Studios and interconnected with TB5. RDMA enabled. I've noticed that on one mac's terminal it says:
[ 04:14:47.6034PM | WARNING ] Failed to find interface name between 12D3KooWNXSHjU3E2FWfDvAV8A5YJj3AXCBDkEeXCGdT1qpNL8bk and 12D3KooWGhMR1nTVa5FQHstttsFycSWoSLzBgHfdnby5H6MGJ6Hh
[ 04:14:47.6035PM | INFO ] finding cycles:
[ 04:14:47.6036PM | INFO ] Searching 12D3KooWGhMR1nTVa5FQHstttsFycSWoSLzBgHfdnby5H6MGJ6Hh for ip 169.254.24.9:
[ 04:14:47.6036PM | INFO ] | en4: 169.254.24.9
[ 04:14:47.6037PM | INFO ] Found
[ 04:14:47.6037PM | INFO ] Interface name for 169.254.24.9 on 12D3KooWGhMR1nTVa5FQHstttsFycSWoSLzBgHfdnby5H6MGJ6Hh: rdma_en4
[ 04:14:47.6038PM | INFO ] Searching 12D3KooWNXSHjU3E2FWfDvAV8A5YJj3AXCBDkEeXCGdT1qpNL8bk for ip 192.168.1.38:
[ 04:14:47.6038PM | INFO ] Searching 12D3KooWNXSHjU3E2FWfDvAV8A5YJj3AXCBDkEeXCGdT1qpNL8bk for ip 169.254.189.7:
[ 04:14:47.6039PM | WARNING ] Failed to find interface name between 12D3KooWNXSHjU3E2FWfDvAV8A5YJj3AXCBDkEeXCGdT1qpNL8bk and 12D3KooWGhMR1nTVa5FQHstttsFycSWoSLzBgHfdnby5H6MGJ6Hh
[ 04:14:47.6039PM | INFO ] finding cycles:
[ 04:14:47.6041PM | INFO ] finding cycles:
[ 04:14:47.6042PM | INFO ] finding cycles:
[ 04:14:47.6043PM | INFO ] Searching 12D3KooWGhMR1nTVa5FQHstttsFycSWoSLzBgHfdnby5H6MGJ6Hh for ip 169.254.24.9:
[ 04:14:47.6044PM | INFO ] | en4: 169.254.24.9
[ 04:14:47.6044PM | INFO ] Found
[ 04:14:47.6045PM | INFO ] Interface name for 169.254.24.9 on 12D3KooWGhMR1nTVa5FQHstttsFycSWoSLzBgHfdnby5H6MGJ6Hh: rdma_en4
[ 04:14:47.6045PM | INFO ] Searching 12D3KooWNXSHjU3E2FWfDvAV8A5YJj3AXCBDkEeXCGdT1qpNL8bk for ip 192.168.1.38:
[ 04:14:47.6046PM | INFO ] Searching 12D3KooWNXSHjU3E2FWfDvAV8A5YJj3AXCBDkEeXCGdT1qpNL8bk for ip 169.254.189.7:
[ 04:14:47.6046PM | WARNING ] Failed to find interface name between 12D3KooWNXSHjU3E2FWfDvAV8A5YJj3AXCBDkEeXCGdT1qpNL8bk and 12D3KooWGhMR1nTVa5FQHstttsFycSWoSLzBgHfdnby5H6MGJ6Hh
[ 04:14:47.6047PM | INFO ] finding cycles:
[ 04:14:47.6048PM | INFO ] Searching 12D3KooWGhMR1nTVa5FQHstttsFycSWoSLzBgHfdnby5H6MGJ6Hh for ip 169.254.24.9:
[ 04:14:47.6048PM | INFO ] | en4: 169.254.24.9
[ 04:14:47.6049PM | INFO ] Found
[ 04:14:47.6049PM | INFO ] Interface name for 169.254.24.9 on 12D3KooWGhMR1nTVa5FQHstttsFycSWoSLzBgHfdnby5H6MGJ6Hh: rdma_en4
[ 04:14:47.6050PM | INFO ] Searching 12D3KooWNXSHjU3E2FWfDvAV8A5YJj3AXCBDkEeXCGdT1qpNL8bk for ip 192.168.1.38:
[ 04:14:47.6050PM | INFO ] Searching 12D3KooWNXSHjU3E2FWfDvAV8A5YJj3AXCBDkEeXCGdT1qpNL8bk for ip 169.254.189.7:
[ 04:14:47.6050PM | WARNING ] Failed to find interface name between 12D3KooWNXSHjU3E2FWfDvAV8A5YJj3AXCBDkEeXCGdT1qpNL8bk and 12D3KooWGhMR1nTVa5FQHstttsFycSWoSLzBgHfdnby5H6MGJ6Hh
[ 04:15:02.6044PM | INFO ] finding cycles:
[ 04:15:02.6049PM | INFO ] finding cycles:
[ 04:15:02.6050PM | INFO ] finding cycles:
[ 04:15:02.6052PM | INFO ] Searching 12D3KooWGhMR1nTVa5FQHstttsFycSWoSLzBgHfdnby5H6MGJ6Hh for ip 169.254.24.9:
[ 04:15:02.6052PM | INFO ] | en4: 169.254.24.9
[ 04:15:02.6053PM | INFO ] Found
[ 04:15:02.6053PM | INFO ] Interface name for 169.254.24.9 on 12D3KooWGhMR1nTVa5FQHstttsFycSWoSLzBgHfdnby5H6MGJ6Hh: rdma_en4
[ 04:15:02.6053PM | INFO ] Searching 12D3KooWNXSHjU3E2FWfDvAV8A5YJj3AXCBDkEeXCGdT1qpNL8bk for ip 192.168.1.38:
[ 04:15:02.6054PM | INFO ] Searching 12D3KooWNXSHjU3E2FWfDvAV8A5YJj3AXCBDkEeXCGdT1qpNL8bk for ip 169.254.189.7:
[ 04:15:02.6054PM | WARNING ] Failed to find interface name between 12D3KooWNXSHjU3E2FWfDvAV8A5YJj3AXCBDkEeXCGdT1qpNL8bk and 12D3KooWGhMR1nTVa5FQHstttsFycSWoSLzBgHfdnby5H6MGJ6Hh
[ 04:15:02.6055PM | INFO ] finding cycles:
[ 04:15:02.6056PM | INFO ] Searching 12D3KooWGhMR1nTVa5FQHstttsFycSWoSLzBgHfdnby5H6MGJ6Hh for ip 169.254.24.9:
[ 04:15:02.6056PM | INFO ] | en4: 169.254.24.9
[ 04:15:02.6057PM | INFO ] Found
[ 04:15:02.6057PM | INFO ] Interface name for 169.254.24.9 on 12D3KooWGhMR1nTVa5FQHstttsFycSWoSLzBgHfdnby5H6MGJ6Hh: rdma_en4
[ 04:15:02.6058PM | INFO ] Searching 12D3KooWNXSHjU3E2FWfDvAV8A5YJj3AXCBDkEeXCGdT1qpNL8bk for ip 192.168.1.38:
[ 04:15:02.6058PM | INFO ] Searching 12D3KooWNXSHjU3E2FWfDvAV8A5YJj3AXCBDkEeXCGdT1qpNL8bk for ip 169.254.189.7:
[ 04:15:02.6059PM | WARNING ] Failed to find interface name between 12D3KooWNXSHjU3E2FWfDvAV8A5YJj3AXCBDkEeXCGdT1qpNL8bk and 12D3KooWGhMR1nTVa5FQHstttsFycSWoSLzBgHfdnby5H6MGJ6Hh
[ 04:15:02.6059PM | INFO ] finding cycles:
[ 04:15:02.6061PM | INFO ] finding cycles:
[ 04:15:02.6063PM | INFO ] finding cycles:
[ 04:15:02.6064PM | INFO ] Searching 12D3KooWGhMR1nTVa5FQHstttsFycSWoSLzBgHfdnby5H6MGJ6Hh for ip 169.254.24.9:
[ 04:15:02.6064PM | INFO ] | en4: 169.254.24.9
[ 04:15:02.6065PM | INFO ] Found
[ 04:15:02.6065PM | INFO ] Interface name for 169.254.24.9 on 12D3KooWGhMR1nTVa5FQHstttsFycSWoSLzBgHfdnby5H6MGJ6Hh: rdma_en4
[ 04:15:02.6066PM | INFO ] Searching 12D3KooWNXSHjU3E2FWfDvAV8A5YJj3AXCBDkEeXCGdT1qpNL8bk for ip 192.168.1.38:
[ 04:15:02.6066PM | INFO ] Searching 12D3KooWNXSHjU3E2FWfDvAV8A5YJj3AXCBDkEeXCGdT1qpNL8bk for ip 169.254.189.7:
[ 04:15:02.6067PM | WARNING ] Failed to find interface name between 12D3KooWNXSHjU3E2FWfDvAV8A5YJj3AXCBDkEeXCGdT1qpNL8bk and 12D3KooWGhMR1nTVa5FQHstttsFycSWoSLzBgHfdnby5H6MGJ6Hh
I am facing exact same issue.
Hi runfeng, I appreciate you're having trouble with this but please don't go into other issues to ask for help that aren't relevant to the problem you're encountering.嗨 runfeng,我知道你遇到了这个问题,但请不要在其他问题下寻求与你遇到的问题无关的帮助。
Try increasing the minimum nodes from 1 to 4 - your topology looks correct, so you should at least be able to place an instance.尝试将最小节点数从 1 增加到 4 - 您的拓扑结构看起来是正确的,所以您至少应该能够放置一个实例。
Ok. Thank you for the reminder
Hi runfeng, I appreciate you're having trouble with this but please don't go into other issues to ask for help that aren't relevant to the problem you're encountering.嗨 runfeng,我知道你遇到了这个问题,但请不要在其他问题下寻求与你遇到的问题无关的帮助。
Try increasing the minimum nodes from 1 to 4 - your topology looks correct, so you should at least be able to place an instance.尝试将最小节点数从 1 增加到 4 - 您的拓扑结构看起来是正确的,所以您至少应该能够放置一个实例。
I have now solved it. Interconnection can be achieved through RDMA, but why does the project always show "Failed" as soon as it runs
Hi runfeng, I appreciate you're having trouble with this but please don't go into other issues to ask for help that aren't relevant to the problem you're encountering.嗨 runfeng,我知道你遇到了这个问题,但请不要在其他问题下寻求与你遇到的问题无关的帮助。 Try increasing the minimum nodes from 1 to 4 - your topology looks correct, so you should at least be able to place an instance.尝试将最小节点数从 1 增加到 4 - 您的拓扑结构看起来是正确的,所以您至少应该能够放置一个实例。
I have now solved it. Interconnection can be achieved through RDMA, but why does the project always show "Failed" as soon as it runs
How did you solved it?
Hi runfeng, I appreciate you're having trouble with this but please don't go into other issues to ask for help that aren't relevant to the problem you're encountering.嗨 runfeng,我知道你遇到了这个问题,但请不要在其他问题下寻求与你遇到的问题无关的帮助。你好,runfeng,我很感激你遇到了这个问题,但请不要进入其他问题来寻求与你遇到的问题无关的帮助。你好,runfeng,我知道你遇到了这个问题,但请不要在其他问题下寻求与你遇到的问题无关的帮助。 Try increasing the minimum nodes from 1 to 4 - your topology looks correct, so you should at least be able to place an instance.尝试将最小节点数从 1 增加到 4 - 您的拓扑结构看起来是正确的,所以您至少应该能够放置一个实例。尝试将最小节点从 1 增加到 4 - 您的拓扑看起来是正确的,所以您至少应该能够放置一个实例。尝试将最小节点数从 1 增加到 4 - 您的拓扑结构看起来是正确的,所以您至少应该能够放置一个实例。
I have now solved it. Interconnection can be achieved through RDMA, but why does the project always show "Failed" as soon as it runs我已经解决了这个问题。互连可以通过 RDMA 实现,但是为什么项目一运行就总是显示“失败”呢?
How did you solved it?你是怎么解决的?
Use EXO_latest.dmg
Hi runfeng, I appreciate you're having trouble with this but please don't go into other issues to ask for help that aren't relevant to the problem you're encountering.嗨 runfeng,我知道你遇到了这个问题,但请不要在其他问题下寻求与你遇到的问题无关的帮助。你好,runfeng,我很感激你遇到了这个问题,但请不要进入其他问题来寻求与你遇到的问题无关的帮助。你好,runfeng,我知道你遇到了这个问题,但请不要在其他问题下寻求与你遇到的问题无关的帮助。 Try increasing the minimum nodes from 1 to 4 - your topology looks correct, so you should at least be able to place an instance.尝试将最小节点数从 1 增加到 4 - 您的拓扑结构看起来是正确的,所以您至少应该能够放置一个实例。尝试将最小节点从 1 增加到 4 - 您的拓扑看起来是正确的,所以您至少应该能够放置一个实例。尝试将最小节点数从 1 增加到 4 - 您的拓扑结构看起来是正确的,所以您至少应该能够放置一个实例。
I have now solved it. Interconnection can be achieved through RDMA, but why does the project always show "Failed" as soon as it runs我已经解决了这个问题。互连可以通过 RDMA 实现,但是为什么项目一运行就总是显示“失败”呢?
How did you solved it?你是怎么解决的?
It helped discover the Thunderbolt bridge
Thanks. I figured something, 2 nodes working fine for me, anything above 2 fails.
Also I am not able to use RDMA for 2 or more nodes. RDMA selection works fine while selecting one node only.
@VoidMore can I confirm this works for you now running the macOS app (https://assets.exolabs.net/EXO-latest.dmg)
@mkamranr are you running from source or with the macOS app?
@imbible are you running from source or with the macOS app?
@imbible are you running from source or with the macOS app?
From the source. I don't want to run the closed-source app.
@VoidMore can I confirm this works for you now running the macOS app (https://assets.exolabs.net/EXO-latest.dmg)我能确认一下你现在运行 macOS 应用(https://assets.exolabs.net/EXO-latest.dmg)时是否正常吗?
There were also instances where multiple nodes were continuously initialized during runtime.
It runs now after re-downloading the latest project, but when downloading larger models, the download suddenly interrupts, and when it restarts, the previously downloaded portion is lost.
Im also facing this issue in a two node cluster im unable to use RDMA. I have both Wifi and Ethernet enabled on the machines and ive run the same commands on each machine found here https://github.com/exo-explore/configs/blob/main/scripts/exo-config-ip.sh
In the logs on the "master" node im seeing this
[ 09:09:40.0077PM | INFO ] Done emitting existing download progress.
[ 09:09:41.8809PM | INFO ] finding cycles:
[ 09:09:41.8813PM | INFO ] finding cycles:
[ 09:09:41.8814PM | INFO ] finding cycles:
[ 09:09:41.8815PM | INFO ] finding cycles:
[ 09:09:42.4463PM | INFO ] RUST: other event Dialing { peer_id: Some(PeerId("12D3KooWS7dbRC2vae2QPovcCHQgVepxJEWAvSRLELjrNwKny7sF")), connection_id: ConnectionId(1) }
[ 09:09:42.4465PM | INFO ] RUST: other event NewExternalAddrOfPeer { peer_id: PeerId("12D3KooWS7dbRC2vae2QPovcCHQgVepxJEWAvSRLELjrNwKny7sF"), address: /ip4/169.254.119.208/tcp/57259/p2p/12D3KooWS7dbRC2vae2QPovcCHQgVepxJEWAvSRLELjrNwKny7sF }
[ 09:09:42.4470PM | INFO ] RUST: other event NewExternalAddrOfPeer { peer_id: PeerId("12D3KooWS7dbRC2vae2QPovcCHQgVepxJEWAvSRLELjrNwKny7sF"), address: /ip4/192.168.1.21/tcp/57259/p2p/12D3KooWS7dbRC2vae2QPovcCHQgVepxJEWAvSRLELjrNwKny7sF }
[ 09:09:42.4473PM | INFO ] RUST: other event NewExternalAddrOfPeer { peer_id: PeerId("12D3KooWS7dbRC2vae2QPovcCHQgVepxJEWAvSRLELjrNwKny7sF"), address: /ip4/169.254.42.26/tcp/57259/p2p/12D3KooWS7dbRC2vae2QPovcCHQgVepxJEWAvSRLELjrNwKny7sF }
[ 09:09:42.4496PM | INFO ] RUST: other event NewExternalAddrOfPeer { peer_id: PeerId("12D3KooWS7dbRC2vae2QPovcCHQgVepxJEWAvSRLELjrNwKny7sF"), address: /ip4/192.168.1.231/tcp/57259/p2p/12D3KooWS7dbRC2vae2QPovcCHQgVepxJEWAvSRLELjrNwKny7sF }
[ 09:09:42.4514PM | INFO ] RUST: other event OutgoingConnectionError { connection_id: ConnectionId(1), peer_id: Some(PeerId("12D3KooWS7dbRC2vae2QPovcCHQgVepxJEWAvSRLELjrNwKny7sF")), error: Transport([(/ip4/169.254.119.208/tcp/57259/p2p/12D3KooWS7dbRC2vae2QPovcCHQgVepxJEWAvSRLELjrNwKny7sF, Other(Custom { kind: Other, error: Other(Right(Custom { kind: Other, error: Left(Right(Apply(Io(Custom { kind: InvalidData, error: Input })))) })) }))]) }
[ 09:09:42.5120PM | INFO ] RUST: other event IncomingConnection { connection_id: ConnectionId(5), local_addr: /ip4/192.168.1.141/tcp/58082, send_back_addr: /ip4/192.168.1.231/tcp/57259 }
[ 09:09:42.5269PM | INFO ] RUST: other event ConnectionEstablished { peer_id: PeerId("12D3KooWS7dbRC2vae2QPovcCHQgVepxJEWAvSRLELjrNwKny7sF"), connection_id: ConnectionId(5), endpoint: Listener { local_addr: /ip4/192.168.1.141/tcp/58082, send_back_addr: /ip4/192.168.1.231/tcp/57259 }, num_established: 1, concurrent_dial_errors: None, established_in: 14.929291ms }
[ 09:09:42.5375PM | INFO ] RUST: other event Behaviour(BehaviourEvent: Subscribed { peer_id: PeerId("12D3KooWS7dbRC2vae2QPovcCHQgVepxJEWAvSRLELjrNwKny7sF"), topic: TopicHash { hash: "commands" } })
[ 09:09:42.5377PM | INFO ] RUST: other event Behaviour(BehaviourEvent: Subscribed { peer_id: PeerId("12D3KooWS7dbRC2vae2QPovcCHQgVepxJEWAvSRLELjrNwKny7sF"), topic: TopicHash { hash: "local_events" } })
[ 09:09:42.5378PM | INFO ] RUST: other event Behaviour(BehaviourEvent: Subscribed { peer_id: PeerId("12D3KooWS7dbRC2vae2QPovcCHQgVepxJEWAvSRLELjrNwKny7sF"), topic: TopicHash { hash: "global_events" } })
[ 09:09:42.5379PM | INFO ] RUST: other event Behaviour(BehaviourEvent: Subscribed { peer_id: PeerId("12D3KooWS7dbRC2vae2QPovcCHQgVepxJEWAvSRLELjrNwKny7sF"), topic: TopicHash { hash: "connection_messages" } })
[ 09:09:42.5380PM | INFO ] RUST: other event Behaviour(BehaviourEvent: Subscribed { peer_id: PeerId("12D3KooWS7dbRC2vae2QPovcCHQgVepxJEWAvSRLELjrNwKny7sF"), topic: TopicHash { hash: "election_messages" } })
[ 09:09:42.7293PM | INFO ] Waiting for other campaign to finish
[ 09:09:45.0025PM | INFO ] finding cycles:
[ 09:09:45.0029PM | WARNING ] You have likely selected ibv for a single node instance; falling back to MlxRing
...
then after which it just starts to spam this
[ 09:10:00.2727PM | INFO ] finding cycles:
[ 09:10:00.2729PM | INFO ] Searching 12D3KooWDsSpgnHTstAfsiWduL7rjFXtgzvKq7pWCRcPur9szoBY for ip 192.168.1.141:
[ 09:10:00.2729PM | INFO ] Searching 12D3KooWDsSpgnHTstAfsiWduL7rjFXtgzvKq7pWCRcPur9szoBY for ip 169.254.39.207:
[ 09:10:00.2730PM | INFO ] Searching 12D3KooWDsSpgnHTstAfsiWduL7rjFXtgzvKq7pWCRcPur9szoBY for ip 192.168.1.164:
[ 09:10:00.2731PM | INFO ] Searching 12D3KooWDsSpgnHTstAfsiWduL7rjFXtgzvKq7pWCRcPur9szoBY for ip 169.254.7.28:
[ 09:10:00.2731PM | INFO ] Searching 12D3KooWDsSpgnHTstAfsiWduL7rjFXtgzvKq7pWCRcPur9szoBY for ip 192.168.1.141:
[ 09:10:00.2732PM | WARNING ] Failed to find interface name between 12D3KooWDsSpgnHTstAfsiWduL7rjFXtgzvKq7pWCRcPur9szoBY and 12D3KooWS7dbRC2vae2QPovcCHQgVepxJEWAvSRLELjrNwKny7sF
[ 09:10:00.2732PM | INFO ] finding cycles:
Dont think it matters for RDMA but jumboframe is enable on the network.
I am able to increase the minimum nodes to 2 when MLX Ring is selected.
Here is the output of ifconfig on the "master" host:
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384
options=1203<RXCSUM,TXCSUM,TXSTATUS,SW_TIMESTAMP>
inet 127.0.0.1 netmask 0xff000000
inet6 ::1 prefixlen 128
inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1
nd6 options=201<PERFORMNUD,DAD>
gif0: flags=8010<POINTOPOINT,MULTICAST> mtu 1280
stf0: flags=0<> mtu 1280
anpi4: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=400<CHANNEL_IO>
ether 96:eb:7e:73:2f:05
media: none
status: inactive
anpi3: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=400<CHANNEL_IO>
ether 96:eb:7e:73:2f:04
nd6 options=201<PERFORMNUD,DAD>
media: 100baseTX <full-duplex>
status: inactive
anpi1: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=400<CHANNEL_IO>
ether 96:eb:7e:73:2f:02
media: none
status: inactive
anpi5: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=400<CHANNEL_IO>
ether 96:eb:7e:73:2f:06
media: none
status: inactive
anpi2: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=400<CHANNEL_IO>
ether 96:eb:7e:73:2f:03
media: none
status: inactive
anpi0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=400<CHANNEL_IO>
ether 96:eb:7e:73:2f:01
media: none
status: inactive
en8: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=400<CHANNEL_IO>
ether 96:eb:7e:73:2f:e1
nd6 options=201<PERFORMNUD,DAD>
media: none
status: inactive
en9: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=400<CHANNEL_IO>
ether 96:eb:7e:73:2f:e2
nd6 options=201<PERFORMNUD,DAD>
media: none
status: inactive
en10: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=400<CHANNEL_IO>
ether 96:eb:7e:73:2f:e3
nd6 options=201<PERFORMNUD,DAD>
media: none
status: inactive
en11: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=400<CHANNEL_IO>
ether 96:eb:7e:73:2f:e4
nd6 options=201<PERFORMNUD,DAD>
media: 100baseTX <full-duplex>
status: inactive
en12: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=400<CHANNEL_IO>
ether 96:eb:7e:73:2f:e5
nd6 options=201<PERFORMNUD,DAD>
media: none
status: inactive
en14: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=400<CHANNEL_IO>
ether 96:eb:7e:73:2f:e6
nd6 options=201<PERFORMNUD,DAD>
media: none
status: inactive
en2: flags=8963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500
options=460<TSO4,TSO6,CHANNEL_IO>
ether 36:2e:c6:ce:18:80
media: autoselect <full-duplex>
status: inactive
en3: flags=8963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500
options=460<TSO4,TSO6,CHANNEL_IO>
ether 36:2e:c6:ce:18:84
media: autoselect <full-duplex>
status: inactive
en4: flags=8963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500
options=460<TSO4,TSO6,CHANNEL_IO>
ether 36:2e:c6:ce:18:88
media: autoselect <full-duplex>
status: inactive
en5: flags=8963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500
options=460<TSO4,TSO6,CHANNEL_IO>
ether 36:2e:c6:ce:18:8c
media: autoselect <full-duplex>
status: active
en6: flags=8963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500
options=460<TSO4,TSO6,CHANNEL_IO>
ether 36:2e:c6:ce:18:90
media: autoselect <full-duplex>
status: inactive
en7: flags=8963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500
options=460<TSO4,TSO6,CHANNEL_IO>
ether 36:2e:c6:ce:18:94
media: autoselect <full-duplex>
status: inactive
en0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=567<RXCSUM,TXCSUM,VLAN_MTU,TSO4,TSO6,AV,CHANNEL_IO>
ether 1c:1d:d3:de:9f:c1
inet6 fe80::18e9:b750:f03c:192c%en0 prefixlen 64 secured scopeid 0x18
inet 192.168.1.164 netmask 0xffffff00 broadcast 192.168.1.255
nd6 options=201<PERFORMNUD,DAD>
media: autoselect (1000baseT <full-duplex,flow-control>)
status: active
bridge0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=63<RXCSUM,TXCSUM,TSO4,TSO6>
ether 36:2e:c6:ce:18:80
inet6 fe80::1c74:2a67:3609:bd71%bridge0 prefixlen 64 secured scopeid 0x19
inet 169.254.39.207 netmask 0xffff0000 broadcast 169.254.255.255
Configuration:
id 0:0:0:0:0:0 priority 0 hellotime 0 fwddelay 0
maxage 0 holdcnt 0 proto stp maxaddr 100 timeout 1200
root id 0:0:0:0:0:0 priority 0 ifcost 0 port 0
ipfilter disabled flags 0x0
member: en2 flags=3<LEARNING,DISCOVER>
ifmaxaddr 0 port 16 priority 0 path cost 0
member: en3 flags=3<LEARNING,DISCOVER>
ifmaxaddr 0 port 17 priority 0 path cost 0
member: en4 flags=3<LEARNING,DISCOVER>
ifmaxaddr 0 port 18 priority 0 path cost 0
member: en5 flags=3<LEARNING,DISCOVER>
ifmaxaddr 0 port 19 priority 0 path cost 0
member: en6 flags=3<LEARNING,DISCOVER>
ifmaxaddr 0 port 20 priority 0 path cost 0
member: en7 flags=3<LEARNING,DISCOVER>
ifmaxaddr 0 port 21 priority 0 path cost 0
nd6 options=201<PERFORMNUD,DAD>
media: autoselect
status: active
ap1: flags=8822<BROADCAST,SMART,SIMPLEX,MULTICAST> mtu 1500
options=400<CHANNEL_IO>
ether d2:ce:b6:25:7d:d3
nd6 options=201<PERFORMNUD,DAD>
media: autoselect (none)
en1: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=6460<TSO4,TSO6,CHANNEL_IO,PARTIAL_CSUM,ZEROINVERT_CSUM>
ether ea:91:6d:0c:57:60
inet6 fe80::8b4:576e:3c41:363a%en1 prefixlen 64 secured scopeid 0x17
inet 192.168.1.141 netmask 0xffffff00 broadcast 192.168.1.255
nd6 options=201<PERFORMNUD,DAD>
media: autoselect
status: active
awdl0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=6460<TSO4,TSO6,CHANNEL_IO,PARTIAL_CSUM,ZEROINVERT_CSUM>
ether 3a:99:46:05:02:ae
inet6 fe80::3899:46ff:fe05:2ae%awdl0 prefixlen 64 scopeid 0x1a
nd6 options=201<PERFORMNUD,DAD>
media: autoselect
status: active
llw0: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=400<CHANNEL_IO>
ether 3a:99:46:05:02:ae
inet6 fe80::3899:46ff:fe05:2ae%llw0 prefixlen 64 scopeid 0x1b
nd6 options=201<PERFORMNUD,DAD>
media: autoselect (none)
utun0: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1380
inet6 fe80::f722:20de:ad91:1dfe%utun0 prefixlen 64 scopeid 0x1c
nd6 options=201<PERFORMNUD,DAD>
utun1: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 2000
inet6 fe80::c54d:330e:41ac:7b0b%utun1 prefixlen 64 scopeid 0x1d
nd6 options=201<PERFORMNUD,DAD>
utun2: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1000
inet6 fe80::ce81:b1c:bd2c:69e%utun2 prefixlen 64 scopeid 0x1e
nd6 options=201<PERFORMNUD,DAD>
utun3: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1500
inet6 fe80::2ee2:cffd:c331:5818%utun3 prefixlen 64 scopeid 0x1f
nd6 options=201<PERFORMNUD,DAD>
en16: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=404<VLAN_MTU,CHANNEL_IO>
ether 26:d6:ea:a0:03:ce
inet6 fe80::24d6:eaff:fea0:3ce%en16 prefixlen 64 scopeid 0x20
nd6 options=201<PERFORMNUD,DAD>
media: autoselect
status: active
en15: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=404<VLAN_MTU,CHANNEL_IO>
ether 26:d6:ea:a0:03:ee
inet6 fe80::18c7:d5d3:3c9a:455b%en15 prefixlen 64 secured scopeid 0x21
inet 169.254.7.28 netmask 0xffff0000 broadcast 169.254.255.255
nd6 options=201<PERFORMNUD,DAD>
media: autoselect
status: active
It does seem like from the logs the process might be seeing the wrong network for RDMA? Or not selecting the correct one at least.
I have done some digging around and I think the found the issue at least for my system. https://github.com/exo-explore/exo/blob/18c4e49f913bc29665f1393c00a0fe200499e091/src/exo/master/placement_utils.py#L262
This line restricts the interfaces that can be found to be used for RDMA, in my case the RDMA interface was > en7.
Is there any reason that we could not use a more robust solution here for finding the RDMA interface instead of just hard coding a list of interfaces that may not actually be correct for all hardware?
Im waiting on a model to download to validate that I can load a model with a patch that includes all the interfaces on my machines.
I have done some digging around and I think the found the issue at least for my system.
exo/src/exo/master/placement_utils.py
Line 262 in 18c4e49
if interface.name not in ["en2", "en3", "en4", "en5", "en6", "en7"]: This line restricts the interfaces that can be found to be used for RDMA, in my case the RDMA interface was
> en7.Is there any reason that we could not use a more robust solution here for finding the RDMA interface instead of just hard coding a list of interfaces that may not actually be correct for all hardware?
Im waiting on a model to download to validate that I can load a model with a patch that includes all the interfaces on my machines.
So this turned out not the be the issue, tracking things down to there really helped figure out my issue here. For me it ended up being clear after installing the app I was able to see that while RDMA showed "enabled" on one node it was not able to find one of the en2-7 ports that are actually the ports OSX runs the RDMA networking on. It came down to looking at the actual machines to see that they were not plugged into the same thunderbolt ports, Node A was furthest port on the right Node B was one neighboring it going towards the ethernet port. Swap to the same ports and everything works, leading me to think it was an issue between the nodeInfo showing is_thunderbolt(self) -> bool = True because of a different interface making this condition true str(self.send_back_multiaddr.ipv4_address).startswith("169.254").
I’m running exo on two Mac Studios (macOS Tahoe 26.2, Thunderbolt 5, RDMA enabled).
I noticed that:
Pipeline and MLX Ring modes allow selecting 2 nodes and work as expected.
But when I select Tensor or Tensor + MLX RDMA, the UI only allows 1 node (minimum nodes is locked to 1).
Both machines can run exo individually, models are synced, and RDMA is enabled via rdma_ctl enable.
Is this a current limitation of exo’s Tensor/RDMA implementation, or is there something missing in my setup? Has anyone been able to use Tensor + RDMA with multiple nodes?
I’m running exo on two Mac Studios (macOS Tahoe 26.2, Thunderbolt 5, RDMA enabled).
I noticed that:
Pipeline and MLX Ring modes allow selecting 2 nodes and work as expected.
But when I select Tensor or Tensor + MLX RDMA, the UI only allows 1 node (minimum nodes is locked to 1).
Both machines can run exo individually, models are synced, and RDMA is enabled via rdma_ctl enable.
Is this a current limitation of exo’s Tensor/RDMA implementation, or is there something missing in my setup? Has anyone been able to use Tensor + RDMA with multiple nodes?
I was seeing something similar a couple trouble shooting steps that lead me to getting it working were:
- look at the output of the
ifconfigon both machines and verify that there is an entry like below for at least one interface ofen2-en7this is important because of this line https://github.com/exo-explore/exo/blob/18c4e49f913bc29665f1393c00a0fe200499e091/src/exo/master/placement_utils.py#L262
en4: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500
options=460<TSO4,TSO6,CHANNEL_IO>
ether 36:2e:c6:ce:18:88
inet6 fe80::6b:b4f5:2d7c:d2d%en4 prefixlen 64 secured scopeid 0x12
inet 169.254.42.15 netmask 0xffff0000 broadcast 169.254.255.255
nd6 options=201<PERFORMNUD,DAD>
media: autoselect <full-duplex>
status: active
- If its the same interface listed on both machines like that swap the thunderbolt cable to a different port on one of them (Not the one next to the ethernet port though.)
I’m running exo on two Mac Studios (macOS Tahoe 26.2, Thunderbolt 5, RDMA enabled). I noticed that: Pipeline and MLX Ring modes allow selecting 2 nodes and work as expected. But when I select Tensor or Tensor + MLX RDMA, the UI only allows 1 node (minimum nodes is locked to 1). Both machines can run exo individually, models are synced, and RDMA is enabled via rdma_ctl enable. Is this a current limitation of exo’s Tensor/RDMA implementation, or is there something missing in my setup? Has anyone been able to use Tensor + RDMA with multiple nodes?
I was seeing something similar a couple trouble shooting steps that lead me to getting it working were:
look at the output of the
ifconfigon both machines and verify that there is an entry like below for at least one interface ofen2-en7this is important because of this line[exo/src/exo/master/placement_utils.py](https://github.com/exo-explore/exo/blob/18c4e49f913bc29665f1393c00a0fe200499e091/src/exo/master/placement_utils.py#L262) Line 262 in [18c4e49](/exo-explore/exo/commit/18c4e49f913bc29665f1393c00a0fe200499e091) if interface.name not in ["en2", "en3", "en4", "en5", "en6", "en7"]:en4: flags=8863<UP,BROADCAST,SMART,RUNNING,SIMPLEX,MULTICAST> mtu 1500 options=460<TSO4,TSO6,CHANNEL_IO> ether 36:2e:c6:ce:18:88 inet6 fe80::6b:b4f5:2d7c:d2d%en4 prefixlen 64 secured scopeid 0x12 inet 169.254.42.15 netmask 0xffff0000 broadcast 169.254.255.255 nd6 options=201<PERFORMNUD,DAD> media: autoselect <full-duplex> status: active
- If its the same interface listed on both machines like that swap the thunderbolt cable to a different port on one of them (Not the one next to the ethernet port though.)
As I mentioned earlier in this thread, I'm seeing the exact same issue as @aaronysl, and unfortunately the workaround you suggested doesn’t seem to resolve it on my side. From what I understand, RDMA typically doesn’t rely on any of the en* interfaces directly, so I’m not sure the interface matching or port swapping is the ultimate solution.
My suspicion is that the macOS app may be doing some additional coordination between nodes behind the scenes, which could be essential for getting Tensor + RDMA modes working properly. If that’s the case, the port change you observed might just be a secondary effect rather than the core requirement.
Since I didn't install the macOS app (I prefer not to rely on closed‑source secret sauce unless they open source it), I may be missing whatever setup or bridging logic it performs.
The latest app(https://exolabs.net/) now supports 2-node operation with both Tensor and RDMA. Note: on Mac Studio, not every Thunderbolt 5 port enables RDMA—try each one. I’ve validated the newest build; feel free to test it yourself. @imbible
Thanks for verifying it and informing me. That said, it’s concerning that an open source project appears to rely on a closed‑source app to unlock full functionality, especially when the app isn’t documented anywhere as a prerequisite for multi‑node Tensor/RDMA support. This makes the whole project difficult to trust.
The app is open source over in the app/ directory! It's not a pre-requisite, and the python server should continue to work just fine. The app eliminates a lot of environment issues, but is just a shim around the python server