exo icon indicating copy to clipboard operation
exo copied to clipboard

grpc problem

Open hefish opened this issue 9 months ago • 4 comments

when i start exo in mac , i get warnings: /Users/hefish/works/exo/.venv/lib/python3.12/site-packages/google/protobuf/runtime_version.py:112: UserWarning: Protobuf gencode version 5.27.2 is older than the runtime version 5.28.1 at node_service.proto. Please avoid checked-in Protobuf gencode that can be obsolete. warnings.warn( Selected inference engine: None

then i install protobuf 5.27.2 , instead and when i run exo in 3 mac mini node, i get error like below:

Traceback (most recent call last): File "/Users/hefish/works/exo/exo/orchestration/node.py", line 606, in send_status_to_peer await asyncio.wait_for(peer.send_opaque_status(request_id, status), timeout=15.0) File "/Users/hefish/miniconda3/envs/exo/lib/python3.12/asyncio/tasks.py", line 520, in wait_for return await fut ^^^^^^^^^ File "/Users/hefish/works/exo/exo/networking/grpc/grpc_peer_handle.py", line 204, in send_opaque_status await asyncio.wait_for(self.stub.SendOpaqueStatus(request), timeout=10.0) File "/Users/hefish/miniconda3/envs/exo/lib/python3.12/asyncio/tasks.py", line 520, in wait_for return await fut ^^^^^^^^^ File "/Users/hefish/miniconda3/envs/exo/lib/python3.12/site-packages/grpc/aio/_call.py", line 327, in await raise _create_rpc_error( grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "Received RST_STREAM with error code 7" debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"Received RST_STREAM with error code 7", grpc_status:14, created_time:"2025-03-20T08:29:02.634141+08:00"}"

it seems communications between nodes get errors.

hefish avatar Mar 20 '25 00:03 hefish

I was running into the same issue.

Downgrading to

 "grpcio==1.67.0",
  "grpcio-tools==1.67.0",

Seems to have resolved it for me.

related to: https://github.com/exo-explore/exo/commit/04d5dca18f9f810228ca98185bde196541f737b4

zsimone10 avatar Mar 20 '25 00:03 zsimone10

thanks. i downgrade grpcio, grpcio-tools to 1.70。 problem resolved.

thanks a lot.

hefish avatar Mar 20 '25 01:03 hefish

Downgraded to 1.70.0 https://github.com/exo-explore/exo/pull/800

AlexCheema avatar Mar 21 '25 22:03 AlexCheema

gRPC Problem Fixed

The gRPC timeout issue has been fixed with root cause analysis.

PR: #887

This PR fixes both #793 and #655 (same root cause).

Root Cause: Server returned cached topology instead of recursively collecting from peers, causing 30-second timeouts during peer discovery.

Solution:

  1. Server-side: Recursive topology collection (grpc_server.py line 120)
  2. Client-side: Retry logic with exponential backoff (grpc_peer_handle.py)

Results:

  • Topology collection: 30s timeout → <1s completion
  • Success rate: ~0% → 100%
  • Verified on 3-node cluster with zero errors over 2+ hours

Full technical details and verification results in PR #887.

palios-taey avatar Oct 20 '25 20:10 palios-taey