exo icon indicating copy to clipboard operation
exo copied to clipboard

[BUG] Model does not load, nodes losing connectivity using Tensor/RDMA

Open mkamranr opened this issue 2 months ago • 7 comments

Describe the bug

When I launch any model, it fails to load. I can see the nodes losing topology connections in the diagram and trying to get back again. After sometime, one or more nodes started showing logs of No RDMA connection availability in exo app.

To Reproduce

Steps to reproduce the behavior:

  1. Launch the instance of model using Tensor/RDMA
  2. The model will fail to load.
  3. The nodes will start losing topology connections

Expected behavior

Model should load by displaying READY status

Actual behavior

Model fails to load.

Environment

  • macOS Version: 26.2
  • EXO Version: 1.0.6.0 (EXO App)
  • Hardware:
    • Device 1: M3 Ultra 512GB
    • Device 2: M3 Ultra 512GB
    • Device 3: M3 Ultra 512GB
    • Device 4: M3 Ultra 512GB
    • Additional devices:
  • Interconnection:
    • TB5
    • Ethernet between all devices

Additional context

Previous version of the app was able to load the model but was also inconsistent, nodes were losing connection. After updating to the latest version of the app, the models are not loading at all.

mkamranr avatar Jan 13 '26 05:01 mkamranr

When you say nodes were losing connection, what do you mean exactly? Do you mean that the nodes are dropping out of the topology?

AlexCheema avatar Jan 13 '26 11:01 AlexCheema

The latest app(https://exolabs.net/) now supports 2-node operation with both Tensor and RDMA. Note: on Mac Studio, not every Thunderbolt 5 port enables RDMA—try each one. I’ve validated the newest build; feel free to test it yourself.

aaronysl avatar Jan 14 '26 01:01 aaronysl

When you say nodes were losing connection, what do you mean exactly? Do you mean that the nodes are dropping out of the topology?

Image

After a while, these links starts dropping randomly.

mkamranr avatar Jan 14 '26 09:01 mkamranr

So I started an instance, it worked fine. But after 10-15 minutes, the connection starts dropping, as you can see in attached images. Now if i try to start a new instance of any model, it does not start at all. Moreover, I can't select more than 2 nodes as you can see in the attached picture.

Image

mkamranr avatar Jan 14 '26 09:01 mkamranr

Just a quick update, there has been an update to the exo app with version 1.0.62 Using this version, I am able to load the model and it's running from last 20 hours, which is great. The only issue or a glitch I am facing is the nodes loosing connections, I enabled the debug mode to see what's going on, I can see the IPs go missing also the connectivity lines too, as you can see in attached image. BUT, it doesn't affect the models running, I am able to query and receive response from the model in that state.

So the question is, Is it just a app front-end glitch?

Image

And here is the image when everything shows up and working

Image

mkamranr avatar Jan 15 '26 03:01 mkamranr

That's quite strange, this is definitely a bug in EXO - we're looking into it

Evanev7 avatar Jan 15 '26 10:01 Evanev7

@mkamranr thanks for reporting the issue. Can you test on latest main and see if this is still an issue? This is a potential fix that got merged: https://github.com/exo-explore/exo/commit/3e623ccf0d18fb0e2d4262e3e57a6743ed0087be

AlexCheema avatar Jan 16 '26 00:01 AlexCheema