graph-node
graph-node copied to clipboard
Firehose connections hang when more than 100 subgraphs are deployed
See this commit for a reproduction example. It exposes a few causes that seem to be conspiring:
- The log below can be obtained by running the example with h2 tracing level at
DEBUG
. It shows that the firehose endpoint sets the http2 settingmax_concurrent_streams
to 100, thereby responding withREFUSED_STREAM
to any additional streams. This limits the number of block streams to 100 on a single connection.
DEBUG Connection{peer=Client}: h2::codec::framed_read: received frame=Settings { flags: (0x0), max_concurrent_streams: 100, initial_window_size: 1048576, max_header_list_size: 65536 }
-
This setting is per connection, and we were supposed to be using connection polling, but the example seems to use a single connection no matter what
conn_pool_size
is set to, so that connection polling implementation must not be working or not be balancing based on number of streams (least loaded, round-robin or randomly would work). -
Tonic seems to not retry establishing the stream after receiving
REFUSED_STREAM
. On retry it logs the below and then goes silent. This might be related to this issue https://github.com/hyperium/tonic/issues/515.
2022-08-26T12:18:13.083821Z TRACE tonic::transport::service::reconnect: poll_ready; connected
2022-08-26T12:18:13.083864Z TRACE tonic::transport::service::reconnect: poll_ready; not ready
2022-08-26T12:18:13.083905Z TRACE hyper::client::dispatch: send_when canceled
On the potential fix:
If Firehose could reliably set max_concurrent_streams
to a high value, that would be great. But afaik Firehose itself does not set a limit, so proxies with draconian defaults must be to blame (nginx for example defaults to 128). Since we don't want to add configuration pitfalls to operators or require specific proxies, we cannot rely on this being set higher than 100 which is the RFC recommended minimum.
The tonic bug seems related but even if tonic retried, the stream would probably be refused again.
So the most reliable fix would be to get connection pooling working with an algorithm that seeks to balance the number of streams per connection.
Mystery solved on why connection pooling wasn't working, tonic uses the URI as the key so repeating the same endpoint doesn't really work.
That is super helpful great investigation thank you very much about all this great knowledge.
And it's great you also found why the connection pooling was not working, weird it's not "reported" somehow would have been easily catchable. Have you raised an issue on the tonic repo about that, I would do it if it's not the case.
Is the connection pooling active on Ethereum also, would be good to retry a shootout to see how it behaves with real connection pooling :)
I assume we shall close this issue now?
We can close this issue after getting the conn_pool_size
config added in https://github.com/graphprotocol/graph-node/pull/3833 to work as intended. Probably by implementing a pool of tonic::Channel
s rather than using tonic for balancing. Meanwhile this can be worked around by configuring multiple providers with the same url, each provider will correspond to a connection.