exo icon indicating copy to clipboard operation
exo copied to clipboard

Flooding GRPC errors when downloading models

Open austinbv opened this issue 1 year ago • 14 comments

When trying to download models the downloads will start but I am flooded with

et (en0))'], ai-mac-5: ['ai-mac-4(Ethernet (en0))', 'ai-mac-3(Ethernet (en0))'], ai-mac-4: ['ai-mac-5(Ethernet (en0))', 'ai-mac-3(Ethernet (en0))']})
Error sending opaque status to ai-mac-4: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Received RST_STREAM with error code 7"
        debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Received RST_STREAM with error code 7", grpc_status:14, created_time:"2025-01-29T15:55:34.05777-07:00"}"
>
Traceback (most recent call last):
  File "/Users/exo/workspace/exo/.venv/lib/python3.12/site-packages/exo/orchestration/node.py", line 606, in send_status_to_peer
    await asyncio.wait_for(peer.send_opaque_status(request_id, status), timeout=15.0)
  File "/Users/exo/.local/share/uv/python/cpython-3.12.8-macos-aarch64-none/lib/python3.12/asyncio/tasks.py", line 520, in wait_for
    return await fut
           ^^^^^^^^^
  File "/Users/exo/workspace/exo/.venv/lib/python3.12/site-packages/exo/networking/grpc/grpc_peer_handle.py", line 197, in send_opaque_status
    await self.stub.SendOpaqueStatus(request)
  File "/Users/exo/workspace/exo/.venv/lib/python3.12/site-packages/grpc/aio/_call.py", line 327, in __await__
    raise _create_rpc_error(
grpc.aio._call.AioRpcError: <AioRpcError of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Received RST_STREAM with error code 7"
        debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"Received RST_STREAM with error code 7", grpc_status:14, created_time:"2025-01-29T15:55:34.05777-07:00"}"

austinbv avatar Jan 29 '25 22:01 austinbv

Looks like one node lost connection. Did it go away after a few seconds?

AlexCheema avatar Jan 30 '25 20:01 AlexCheema

It shouldn't have they are all thunderbolt and going though a 1gb switch.

austinbv avatar Jan 31 '25 04:01 austinbv

Following up. I don't think that's it. it floods thousands of these errors pretty consistently when downloading models. Tiny chat also freezes and downloads stall. Restarting exo solves for a bit but happens again on any large download

austinbv avatar Jan 31 '25 04:01 austinbv

I have exact same problems with larger model download, waiting for the resolution.

cnsren avatar Jan 31 '25 17:01 cnsren

I have exact same problems with larger model download, waiting for the resolution.

How are you connecting?

austinbv avatar Jan 31 '25 17:01 austinbv

Does the download complete successfully? It looks like something related to high network load

AlexCheema avatar Feb 03 '25 22:02 AlexCheema

It does not. You need to restart the Exo process and restart the downloads

austinbv avatar Feb 03 '25 23:02 austinbv

me too

dakecrazy avatar Feb 06 '25 08:02 dakecrazy

I use thunderbolt 5

dakecrazy avatar Feb 06 '25 08:02 dakecrazy

Im getting this as well

pcfreak30 avatar Feb 17 '25 02:02 pcfreak30

Same here. Continuously loses connection with one node and outputs thousands of these messages, while downloading large models.

gogothegreen avatar Mar 18 '25 23:03 gogothegreen

+1 to this issue. Still happening on June 26, 2025

esper2142 avatar Jun 26 '25 14:06 esper2142

Happening here too, with latest Exo on a cluster of Raspberry Pis. It makes my terminal window jumpy too, if I have the Exo UI running. It seemed to continue downloading, just with tons of those messages as each node seemed to download a chunk of the model one after the other.

geerlingguy avatar Aug 29 '25 21:08 geerlingguy

gRPC Topology Collection Fix Available

I've identified and fixed the root cause of the gRPC flooding errors during model downloads.

PR: #887

The issue was caused by topology collection timing out (30s) during peer discovery. This happened because:

  1. Server-side: Returned cached topology instead of recursively collecting from peers
  2. Client-side: No retry logic on collect_topology RPC call (unlike other critical calls)

Solution

  • Changed server to recursively collect topology (1 line fix in grpc_server.py)
  • Added retry logic with exponential backoff (6 lines in grpc_peer_handle.py)
  • Total: 7 lines across 2 files

Verification

  • Before: 30s timeout → failure every 2 seconds
  • After: <1s completion, 100% success rate
  • Testing: 3-node cluster, 2+ hours stable operation with zero timeout errors

This also fixes issue #793 (same root cause - gRPC topology collection timeouts).

Full root cause analysis and testing results available in the PR.

palios-taey avatar Oct 20 '25 20:10 palios-taey