exo icon indicating copy to clipboard operation
exo copied to clipboard

Can someone help this noob (2 node cluster unresponsive)

Open NotReallyADeveloper opened this issue 9 months ago • 6 comments

Am trying to cluster a Mac Mini pro and a Mac Mini Studio Pro Max, connected via an Apple Thunderbolt 5 cable.

Nothing wrong with the connectivity, and can use Llama 3.2 3B without issues when using a single computer. But as soon as I add a second node, the LLM stops responding completely.

MLX is installed and configured on both.

What am I missing here?

NotReallyADeveloper avatar Mar 17 '25 15:03 NotReallyADeveloper

I have the same question. how to solve it?

Sunchy389 avatar Mar 19 '25 23:03 Sunchy389

i have the same . logs show that grpc has some error. so node communications may have troubles.

Traceback (most recent call last): File "/Users/hefish/works/exo/exo/orchestration/node.py", line 606, in send_status_to_peer await asyncio.wait_for(peer.send_opaque_status(request_id, status), timeout=15.0) File "/Users/hefish/miniconda3/envs/exo/lib/python3.12/asyncio/tasks.py", line 520, in wait_for return await fut ^^^^^^^^^ File "/Users/hefish/works/exo/exo/networking/grpc/grpc_peer_handle.py", line 204, in send_opaque_status await asyncio.wait_for(self.stub.SendOpaqueStatus(request), timeout=10.0) File "/Users/hefish/miniconda3/envs/exo/lib/python3.12/asyncio/tasks.py", line 520, in wait_for return await fut ^^^^^^^^^ File "/Users/hefish/miniconda3/envs/exo/lib/python3.12/site-packages/grpc/aio/_call.py", line 327, in await raise _create_rpc_error( grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "Received RST_STREAM with error code 7" debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"Received RST_STREAM with error code 7", grpc_status:14, created_time:"2025-03-20T08:29:02.634141+08:00"}"

hefish avatar Mar 20 '25 00:03 hefish

我发现把 grpc, grpc-tools 两个模块降级到 1.70 ,即可解决。 查看一下最新版的grpc,是2025年3月11日更新到1.71的。所以我回想起来 2月份的时候还是好的。 看起来是grpc的组件升级导致的。

hefish avatar Mar 20 '25 01:03 hefish

降级是这个语句么 "grpcio==1.67.0", "grpcio-tools==1.67.0",

Sunchy389 avatar Mar 22 '25 01:03 Sunchy389

Try to downgrade to 1.67.0 but issue still

cropse avatar Apr 13 '25 03:04 cropse

Same issue. For me, machine I am on shows two nodes in the exo UI, whereas the one I am sshed into shows only 1 node. So perhaps some kind of partial connection issue.

nynj avatar May 26 '25 00:05 nynj

Similar issue https://github.com/exo-explore/exo/issues/603

fabiooshiro avatar Aug 02 '25 22:08 fabiooshiro