yorkie
yorkie copied to clipboard
Mitigate Split-brain of Long-lived Connection
Description:
Recently, we have introduced a Sharded cluster mode to support production environment.
But, there is a issue on long-lived connection like WatchDocument
RPC, where connection is split-brained
when backend host set changes(server added or removed). To get more context about this issue, follow the links below:
- Sharded Cluster Mode Design Document: Risks and Mitigation
- Long-lived Connection Closing Issue in envoy
Currently, we are mitigating this issue with forceful connection close with envoy's stream_idle_timeout
, so that when connection gets idle for a while(1 min) due to connection split-brain, connection is forcefully closed and rerouted to proper backend server.
But this is not a perfect way to solve this issue, because there will be a time period (about 1 min or less) between connection gets split-brained and reestablished by forceful closure. And between this time period, users cannot receive any change notifications via WatchDocument
, which will decrease sync sensitivity between peers.
To solve this issue, we need to introduce graceful and instant way to reestablished connection when split-brain occurs.
Since gRPC is based on HTTP/2, we can use HTTP/2 GOAWAY
frame to gracefully close connection. As RFC 7540 defines, GOAWAY
frame is used to initiate graceful connection close.
The GOAWAY frame (type=0x7) is used to initiate shutdown of a connection or to signal serious error conditions. GOAWAY allows an endpoint to gracefully stop accepting new streams while still finishing processing of previously established streams. This enables administrative actions, like server maintenance.
We can use gRPC's MAX_CONNECTION_AGE
to send GOAWAY frame when connection reaches max age to keep alive (This is what gRPC suggests to use when load balancing long-lived connection).
Moreover, we can use envoy's close_connections_on_host_set_change
to instantly and gracefully close connection. This is because this option drains connections when backend host set changes, and drain sequence sends HTTP2 GOAWAY to terminate connection.
But GOAWAY is not a signal to close connection instantly. It's just a signal to tell client not to send additional request to server(https://github.com/grpc/grpc-java/issues/8770), so we need to handle connection closure in client-side.
When client receives GOAWAY frame from the server, client needs to reset connection and reestablish connection.
So overall sequence will look something like this:
-
close_connections_on_host_set_change
is set in envoy proxy. - Backend host set changes(server added or removed).
- envoy notices backend host set changes, and starts draining sequence.
- In draining sequence, envoy sends HTTP2 GOAWAY frame to clients.
- Client receives GOAWAY frame from the server(actually proxy)
- Client resets connection and establish new connection.
This process will ensure instant and graceful way to close connection, and completely resolve WatchDocument
's split-brain issue.
We need to implement GOAWAY handler in client-side, in Go SDK, JS SDK, and etc. I'm currently searching for way to implement in Go, and I will update process in the comments below.
Why:
To completely resolve decreased sync sensitivity between peers caused by WatchDocument
's split-brain issue.
I've confirmed that server is sending GOAWAY
frame when stream exceeds MAX_CONNECTION_AGE
by setting GODEBUG=http2debug=2
environment value for http2 tracing.
--- After stream exceeds MAX_CONNECTION_AGE ---
2023/04/28 18:16:31 http2: Framer 0x14000188000: wrote GOAWAY len=8 LastStreamID=2147483647 ErrCode=NO_ERROR Debug=""
2023/04/28 18:16:31 http2: Framer 0x14000188000: wrote PING len=8 ping="\x01\x06\x01\b\x00\x03\x03\t"
2023/04/28 18:16:31 http2: Framer 0x140006b2000: read GOAWAY len=8 LastStreamID=2147483647 ErrCode=NO_ERROR Debug=""
2023/04/28 18:16:31 http2: Framer 0x140006b2000: read PING len=8 ping="\x01\x06\x01\b\x00\x03\x03\t"
2023/04/28 18:16:31 http2: Framer 0x140006b2000: wrote PING flags=ACK len=8 ping="\x01\x06\x01\b\x00\x03\x03\t"
2023/04/28 18:16:31 http2: Framer 0x14000188000: read PING flags=ACK len=8 ping="\x01\x06\x01\b\x00\x03\x03\t"
2023/04/28 18:16:31 http2: Framer 0x14000188000: wrote GOAWAY len=8 LastStreamID=5 ErrCode=NO_ERROR Debug=""
2023/04/28 18:16:31 http2: Framer 0x140006b2000: read GOAWAY len=8 LastStreamID=5 ErrCode=NO_ERROR Debug=""
But I'm still searching for how to capture http2 GOAWAY frame in gRPC.
As of my understanding, gRPC's HTTP/2 transport layer(http2_server
and http2_client
) is handling GOAWAY
, but It is not closing stream on receiving GOAWAY
. Also I couldn't find a way to get HTTP/2 frame with gRPC go-sdk.
So I left a question in grpc/grpc-go, hoping gRPC members can provide me a good explanation for my questions.
I have discussed this issue with grpc community, and I found out that our WatchDoument
RPC handler is not properly coded.
Graceful close of connections wait for existing streams to be closed before the connection is closed. If your server RPC handler never returns, then existing streams will not be closed, and therefore graceful connection close will not happen.
Since our WatchDoument
server-side streaming RPC handler never returns, there will be no "graceful close" of connection, so even when GOAWAY is sent, there will be no additional graceful connection close.
Therefore, we might need to add timer in our WatchDocument
RPC handler to return when timer expires, and perform graceful connection close. I think we can combine this method with MaxConnectionAge
and MaxConnectionAgeGrace
to ensure proper connection close (eg: MaxConnectionAge
set to 60, RPC timer to 70, and MaxConnectionAgeGrace
to 80 or so).
After more researches on gRPC server-side streaming usage, I have found Kubecon 2018 video: Using gRPC for Long-lived and Streaming RPCs - Eric Anderson, Google which explains about gRPC's long-lived RPC's issue and it's improvements.
This is what I have concluded based on above reference.
- gRPC server-side streaming can be connected for days, or more (use cases of this RPC is watch/notification).
- But server-side streaming can have some problem when load balancing (this is because load balancing is performed on RPC bases, already created RPC is connected to old backend throughout its life-time and will not connected to new backend even when new backend comes up).
-
MAX_CONNECTION_AGE
does not kill connection itself, so only using this option will not resolve this issue (just sendingGOAWAY
will not close connection). - To improve this issue on load balancing, server should close the RPC occasionally, and add
MAX_CONNECTION_AGE_GRACE
option withMAX_CONNECTION_AGE
to forcefully close connection. gRPC suggests to use these options as a backup so that connection can be eventually closed.
Therefore, I suggest two options for RPC connection close.
- RPC timer +
MAX_CONNECTION_AGE + MAX_CONNECTION_AGE_GRACE
: introduce timer onWatchDocument
RPC to periodically close connection, and setMAX_CONNECTION_AGE
andMAX_CONNECTION_AGE_GRACE
options as a backup to close RPC. -
stream_idle_timeout
+MAX_CONNECTION_AGE + MAX_CONNECTION_AGE_GRACE
: Usestream_idle_timeout
to detect idle connection and close it to minimize split-brain time of connection when upstream host changes, and setMAX_CONNECTION_AGE
andMAX_CONNECTION_AGE_GRACE
options as a backup to close RPC.
Option 1 is the "graceful" and "suggested" way to improve(resolve) this issue, but I think option 2 is more suitable considering our use cases. Because Yorkie is used for "real-time" collaboration, sync sensitivity between peers is very important. Therefore noticing split-brain issue and closing connection as soon as possible is more important than having graceful/long interval of connection close.
This stream_idle_timeout
option will emit errors periodically when only one user just keep the document opened but not doing anything. So I think we should catch and hide this P2_PROTOCOL_ERROR
error caused by stream_idle_timeout
from clients.
To conclude:
- Keep using
stream_idle_timeout
+MAX_CONNECTION_AGE + MAX_CONNECTION_AGE_GRACE
options. - But it will be better to catch and hide
P2_PROTOCOL_ERROR
error caused by forceful connection close ofstream_idle_timeout
in our clients.
Related to https://github.com/yorkie-team/devops/issues/21
We need to reconsider this issue because we changed RPC from gRPC to Connect. https://github.com/yorkie-team/yorkie/issues/668