Flooding GRPC errors when downloading models
When trying to download models the downloads will start but I am flooded with
et (en0))'], ai-mac-5: ['ai-mac-4(Ethernet (en0))', 'ai-mac-3(Ethernet (en0))'], ai-mac-4: ['ai-mac-5(Ethernet (en0))', 'ai-mac-3(Ethernet (en0))']})
Error sending opaque status to ai-mac-4: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Received RST_STREAM with error code 7"
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"Received RST_STREAM with error code 7", grpc_status:14, created_time:"2025-01-29T15:55:34.05777-07:00"}"
>
Traceback (most recent call last):
File "/Users/exo/workspace/exo/.venv/lib/python3.12/site-packages/exo/orchestration/node.py", line 606, in send_status_to_peer
await asyncio.wait_for(peer.send_opaque_status(request_id, status), timeout=15.0)
File "/Users/exo/.local/share/uv/python/cpython-3.12.8-macos-aarch64-none/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
return await fut
^^^^^^^^^
File "/Users/exo/workspace/exo/.venv/lib/python3.12/site-packages/exo/networking/grpc/grpc_peer_handle.py", line 197, in send_opaque_status
await self.stub.SendOpaqueStatus(request)
File "/Users/exo/workspace/exo/.venv/lib/python3.12/site-packages/grpc/aio/_call.py", line 327, in __await__
raise _create_rpc_error(
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Received RST_STREAM with error code 7"
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"Received RST_STREAM with error code 7", grpc_status:14, created_time:"2025-01-29T15:55:34.05777-07:00"}"
Looks like one node lost connection. Did it go away after a few seconds?
It shouldn't have they are all thunderbolt and going though a 1gb switch.
Following up. I don't think that's it. it floods thousands of these errors pretty consistently when downloading models. Tiny chat also freezes and downloads stall. Restarting exo solves for a bit but happens again on any large download
I have exact same problems with larger model download, waiting for the resolution.
I have exact same problems with larger model download, waiting for the resolution.
How are you connecting?
Does the download complete successfully? It looks like something related to high network load
It does not. You need to restart the Exo process and restart the downloads
me too
I use thunderbolt 5
Im getting this as well
Same here. Continuously loses connection with one node and outputs thousands of these messages, while downloading large models.
+1 to this issue. Still happening on June 26, 2025
Happening here too, with latest Exo on a cluster of Raspberry Pis. It makes my terminal window jumpy too, if I have the Exo UI running. It seemed to continue downloading, just with tons of those messages as each node seemed to download a chunk of the model one after the other.
gRPC Topology Collection Fix Available
I've identified and fixed the root cause of the gRPC flooding errors during model downloads.
PR: #887
The issue was caused by topology collection timing out (30s) during peer discovery. This happened because:
- Server-side: Returned cached topology instead of recursively collecting from peers
-
Client-side: No retry logic on
collect_topologyRPC call (unlike other critical calls)
Solution
- Changed server to recursively collect topology (1 line fix in
grpc_server.py) - Added retry logic with exponential backoff (6 lines in
grpc_peer_handle.py) - Total: 7 lines across 2 files
Verification
- Before: 30s timeout → failure every 2 seconds
- After: <1s completion, 100% success rate
- Testing: 3-node cluster, 2+ hours stable operation with zero timeout errors
This also fixes issue #793 (same root cause - gRPC topology collection timeouts).
Full root cause analysis and testing results available in the PR.