grpc problem
when i start exo in mac , i get warnings: /Users/hefish/works/exo/.venv/lib/python3.12/site-packages/google/protobuf/runtime_version.py:112: UserWarning: Protobuf gencode version 5.27.2 is older than the runtime version 5.28.1 at node_service.proto. Please avoid checked-in Protobuf gencode that can be obsolete. warnings.warn( Selected inference engine: None
then i install protobuf 5.27.2 , instead and when i run exo in 3 mac mini node, i get error like below:
Traceback (most recent call last): File "/Users/hefish/works/exo/exo/orchestration/node.py", line 606, in send_status_to_peer await asyncio.wait_for(peer.send_opaque_status(request_id, status), timeout=15.0) File "/Users/hefish/miniconda3/envs/exo/lib/python3.12/asyncio/tasks.py", line 520, in wait_for return await fut ^^^^^^^^^ File "/Users/hefish/works/exo/exo/networking/grpc/grpc_peer_handle.py", line 204, in send_opaque_status await asyncio.wait_for(self.stub.SendOpaqueStatus(request), timeout=10.0) File "/Users/hefish/miniconda3/envs/exo/lib/python3.12/asyncio/tasks.py", line 520, in wait_for return await fut ^^^^^^^^^ File "/Users/hefish/miniconda3/envs/exo/lib/python3.12/site-packages/grpc/aio/_call.py", line 327, in await raise _create_rpc_error( grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "Received RST_STREAM with error code 7" debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"Received RST_STREAM with error code 7", grpc_status:14, created_time:"2025-03-20T08:29:02.634141+08:00"}"
it seems communications between nodes get errors.
I was running into the same issue.
Downgrading to
"grpcio==1.67.0",
"grpcio-tools==1.67.0",
Seems to have resolved it for me.
related to: https://github.com/exo-explore/exo/commit/04d5dca18f9f810228ca98185bde196541f737b4
thanks. i downgrade grpcio, grpcio-tools to 1.70。 problem resolved.
thanks a lot.
Downgraded to 1.70.0 https://github.com/exo-explore/exo/pull/800
gRPC Problem Fixed
The gRPC timeout issue has been fixed with root cause analysis.
PR: #887
This PR fixes both #793 and #655 (same root cause).
Root Cause: Server returned cached topology instead of recursively collecting from peers, causing 30-second timeouts during peer discovery.
Solution:
- Server-side: Recursive topology collection (
grpc_server.pyline 120) - Client-side: Retry logic with exponential backoff (
grpc_peer_handle.py)
Results:
- Topology collection: 30s timeout → <1s completion
- Success rate: ~0% → 100%
- Verified on 3-node cluster with zero errors over 2+ hours
Full technical details and verification results in PR #887.