yorkie Mitigate Split-brain of Long-lived Connection

Description:

Recently, we have introduced a Sharded cluster mode to support production environment.

But, there is a issue on long-lived connection like WatchDocument RPC, where connection is split-brained when backend host set changes(server added or removed). To get more context about this issue, follow the links below:

Currently, we are mitigating this issue with forceful connection close with envoy's stream_idle_timeout, so that when connection gets idle for a while(1 min) due to connection split-brain, connection is forcefully closed and rerouted to proper backend server.

But this is not a perfect way to solve this issue, because there will be a time period (about 1 min or less) between connection gets split-brained and reestablished by forceful closure. And between this time period, users cannot receive any change notifications via WatchDocument, which will decrease sync sensitivity between peers.

To solve this issue, we need to introduce graceful and instant way to reestablished connection when split-brain occurs.

Since gRPC is based on HTTP/2, we can use HTTP/2 GOAWAY frame to gracefully close connection. As RFC 7540 defines, GOAWAY frame is used to initiate graceful connection close.

The GOAWAY frame (type=0x7) is used to initiate shutdown of a connection or to signal serious error conditions. GOAWAY allows an endpoint to gracefully stop accepting new streams while still finishing processing of previously established streams. This enables administrative actions, like server maintenance.

We can use gRPC's MAX_CONNECTION_AGE to send GOAWAY frame when connection reaches max age to keep alive (This is what gRPC suggests to use when load balancing long-lived connection).

Moreover, we can use envoy's close_connections_on_host_set_change to instantly and gracefully close connection. This is because this option drains connections when backend host set changes, and drain sequence sends HTTP2 GOAWAY to terminate connection.

But GOAWAY is not a signal to close connection instantly. It's just a signal to tell client not to send additional request to server(https://github.com/grpc/grpc-java/issues/8770), so we need to handle connection closure in client-side.

When client receives GOAWAY frame from the server, client needs to reset connection and reestablish connection.

So overall sequence will look something like this:

close_connections_on_host_set_change is set in envoy proxy.
Backend host set changes(server added or removed).
envoy notices backend host set changes, and starts draining sequence.
In draining sequence, envoy sends HTTP2 GOAWAY frame to clients.
Client receives GOAWAY frame from the server(actually proxy)
Client resets connection and establish new connection.

This process will ensure instant and graceful way to close connection, and completely resolve WatchDocument's split-brain issue.

We need to implement GOAWAY handler in client-side, in Go SDK, JS SDK, and etc. I'm currently searching for way to implement in Go, and I will update process in the comments below.

Why:

To completely resolve decreased sync sensitivity between peers caused by WatchDocument's split-brain issue.

Apr 28 '23 09:04 krapie

I've confirmed that server is sending GOAWAY frame when stream exceeds MAX_CONNECTION_AGE by setting GODEBUG=http2debug=2 environment value for http2 tracing.


--- After stream exceeds MAX_CONNECTION_AGE ---
2023/04/28 18:16:31 http2: Framer 0x14000188000: wrote GOAWAY len=8 LastStreamID=2147483647 ErrCode=NO_ERROR Debug=""
2023/04/28 18:16:31 http2: Framer 0x14000188000: wrote PING len=8 ping="\x01\x06\x01\b\x00\x03\x03\t"
2023/04/28 18:16:31 http2: Framer 0x140006b2000: read GOAWAY len=8 LastStreamID=2147483647 ErrCode=NO_ERROR Debug=""
2023/04/28 18:16:31 http2: Framer 0x140006b2000: read PING len=8 ping="\x01\x06\x01\b\x00\x03\x03\t"
2023/04/28 18:16:31 http2: Framer 0x140006b2000: wrote PING flags=ACK len=8 ping="\x01\x06\x01\b\x00\x03\x03\t"
2023/04/28 18:16:31 http2: Framer 0x14000188000: read PING flags=ACK len=8 ping="\x01\x06\x01\b\x00\x03\x03\t"
2023/04/28 18:16:31 http2: Framer 0x14000188000: wrote GOAWAY len=8 LastStreamID=5 ErrCode=NO_ERROR Debug=""
2023/04/28 18:16:31 http2: Framer 0x140006b2000: read GOAWAY len=8 LastStreamID=5 ErrCode=NO_ERROR Debug=""

But I'm still searching for how to capture http2 GOAWAY frame in gRPC.

Apr 28 '23 09:04 krapie

As of my understanding, gRPC's HTTP/2 transport layer(http2_server and http2_client) is handling GOAWAY, but It is not closing stream on receiving GOAWAY. Also I couldn't find a way to get HTTP/2 frame with gRPC go-sdk.

So I left a question in grpc/grpc-go, hoping gRPC members can provide me a good explanation for my questions.

Apr 29 '23 09:04 krapie

I have discussed this issue with grpc community, and I found out that our WatchDoument RPC handler is not properly coded.

Graceful close of connections wait for existing streams to be closed before the connection is closed. If your server RPC handler never returns, then existing streams will not be closed, and therefore graceful connection close will not happen.

Since our WatchDoument server-side streaming RPC handler never returns, there will be no "graceful close" of connection, so even when GOAWAY is sent, there will be no additional graceful connection close.

Therefore, we might need to add timer in our WatchDocument RPC handler to return when timer expires, and perform graceful connection close. I think we can combine this method with MaxConnectionAge and MaxConnectionAgeGrace to ensure proper connection close (eg: MaxConnectionAge set to 60, RPC timer to 70, and MaxConnectionAgeGrace to 80 or so).

May 08 '23 12:05 krapie

After more researches on gRPC server-side streaming usage, I have found Kubecon 2018 video: Using gRPC for Long-lived and Streaming RPCs - Eric Anderson, Google which explains about gRPC's long-lived RPC's issue and it's improvements.

This is what I have concluded based on above reference.

gRPC server-side streaming can be connected for days, or more (use cases of this RPC is watch/notification).
But server-side streaming can have some problem when load balancing (this is because load balancing is performed on RPC bases, already created RPC is connected to old backend throughout its life-time and will not connected to new backend even when new backend comes up).
MAX_CONNECTION_AGE does not kill connection itself, so only using this option will not resolve this issue (just sending GOAWAY will not close connection).
To improve this issue on load balancing, server should close the RPC occasionally, and add MAX_CONNECTION_AGE_GRACE option with MAX_CONNECTION_AGE to forcefully close connection. gRPC suggests to use these options as a backup so that connection can be eventually closed.

Therefore, I suggest two options for RPC connection close.

RPC timer + MAX_CONNECTION_AGE + MAX_CONNECTION_AGE_GRACE: introduce timer on WatchDocument RPC to periodically close connection, and set MAX_CONNECTION_AGE and MAX_CONNECTION_AGE_GRACE options as a backup to close RPC.
stream_idle_timeout + MAX_CONNECTION_AGE + MAX_CONNECTION_AGE_GRACE: Usestream_idle_timeout to detect idle connection and close it to minimize split-brain time of connection when upstream host changes, and set MAX_CONNECTION_AGE and MAX_CONNECTION_AGE_GRACE options as a backup to close RPC.

Option 1 is the "graceful" and "suggested" way to improve(resolve) this issue, but I think option 2 is more suitable considering our use cases. Because Yorkie is used for "real-time" collaboration, sync sensitivity between peers is very important. Therefore noticing split-brain issue and closing connection as soon as possible is more important than having graceful/long interval of connection close.

This stream_idle_timeout option will emit errors periodically when only one user just keep the document opened but not doing anything. So I think we should catch and hide this P2_PROTOCOL_ERROR error caused by stream_idle_timeout from clients.

May 16 '23 08:05 krapie

To conclude:

Keep using stream_idle_timeout + MAX_CONNECTION_AGE + MAX_CONNECTION_AGE_GRACE options.
But it will be better to catch and hide P2_PROTOCOL_ERROR error caused by forceful connection close of stream_idle_timeout in our clients.

May 16 '23 08:05 krapie

Related to https://github.com/yorkie-team/devops/issues/21

Jun 05 '23 05:06 hackerwins

We need to reconsider this issue because we changed RPC from gRPC to Connect. https://github.com/yorkie-team/yorkie/issues/668

Dec 15 '23 09:12 hackerwins

yorkie yorkie copied to clipboard

Mitigate Split-brain of Long-lived Connection

yorkie
yorkie copied to clipboard