envoy
envoy copied to clipboard
Envoy ALS message rotation on multiple streams
This issue proposes to introduce connection re-establishment in the Envoy ALS gRPC client. In the current implementation, the connection between Envoy and the receiver survives for so long that scaling out the receiver does not make sense; in order to perform load balancing of the Envoy ALS receiver, the Async gRPC client of Envoy needs to re-establish the connection at regular intervals or at some To load balance the Envoy ALS receiver, the Async gRPC client of Envoy needs to re-establish the connection at regular intervals or according to some other criteria. When no connection is established, logs are watermark buffered and flushed as needed.
cc. @lizan
I had a discussion with @wu-sheng offline, and my idea is just "having multiple streams and rotate/drain one stream periodically at a time" rather than "use/reconnect one stream and buffer logs if stream being unavailable". This way any logs won't get lost during the load balancing/reconnects. The downside here is that maybe the implementation would be complex than just reconnectiong/buffering solution. But I'm confident it is implementable. @wu-sheng @lizan thoughts?
"having multiple streams and rotate/drain one stream periodically at a time" rather than "use/reconnect one stream and buffer logs if stream being unavailable".
Make sense to me. Envoy could cut unused connections in the pool, and try to re-connect for a new upstream.
I had a discussion with @wu-sheng offline, and my idea is just "having multiple streams and rotate/drain one stream periodically at a time" rather than "use/reconnect one stream and buffer logs if stream being unavailable". This way any logs won't get lost during the load balancing/reconnects. The downside here is that maybe the implementation would be complex than just reconnectiong/buffering solution. But I'm confident it is implementable. @wu-sheng @lizan thoughts?
@mathetake @wu-sheng I also think this approach is better in this situation. But I have some questions.
-
Suppose that you initially establish connections with multiple APM clusters and that messages are balanced evenly across the streams on the connections. Later, when all APM clusters are restarted, the logs will be lost, what are your thoughts on such a case? I don't think there is any need to consider buffering for such a case.
-
When the number of APM clusters increases due to scale-out, it will be necessary to create new connections. What do you think about the possible performance concerns when such a large number of connections is increased? In such a case, it may be necessary to aggregate the connections in some way, but I don't have an optimal idea for that. Another option would be to configure the system to not establish more than a certain number of connections.
I had a discussion with @wu-sheng offline, and my idea is just "having multiple streams and rotate/drain one stream periodically at a time" rather than "use/reconnect one stream and buffer logs if stream being unavailable". This way any logs won't get lost during the load balancing/reconnects. The downside here is that maybe the implementation would be complex than just reconnectiong/buffering solution. But I'm confident it is implementable. @wu-sheng @lizan thoughts?
@mathetake @wu-sheng I also think this approach is better in this situation. But I have some questions.
- Suppose that you initially establish connections with multiple APM clusters and that messages are balanced evenly across the streams on the connections. Later, when all APM clusters are restarted, the logs will be lost, what are your thoughts on such a case? I don't think there is any need to consider buffering for such a case.
If all skywalking reboots, then it is not envoy responsibility.
- When the number of APM clusters increases due to scale-out, it will be necessary to create new connections. What do you think about the possible performance concerns when such a large number of connections is increased? In such a case, it may be necessary to aggregate the connections in some way, but I don't have an optimal idea for that. Another option would be to configure the system to not establish more than a certain number of connections.
Yes, we should provide a configuration to set threshold for max connection to upstream.
ok. We may have any points that we need further discussion, but I can take this task.
I think we're mixing two issues here:
- logs are not buffered for retry/connection loss.
- long-live ALS streams are not load balanced
For 1. it needs modification to existing ALS service, https://github.com/envoyproxy/envoy/blob/main/api/envoy/service/accesslog/v3/als.proto#L22-L27 to let ALS service ACK the receipt. For 2. implement a reconnecting after certain time/number of request will be great.
The two solution should be configurable at the same time, then the behavior would be much like "having multiple streams and rotate/drain one stream periodically at a time".
I agree with ^. I thought I had a TODO on connection rotation but I can't find it.
Reading https://github.com/envoyproxy/envoy/issues/17331 the reason we need this sounds like we want connection rotation? Would it make sense to implement this connection rotation core instead of exposing additional implementation details to ALS and have it implement its own rotation logic?
@mattklein123 @alyssawilk
+1, we've had other requests for load balancing across connections (e.g https://github.com/envoyproxy/envoy/issues/16426) so I think we want to add this functionality to the connection pool
AFAIK, connection rotation is hard if base on grpc streaming request. ALS client would know how to routine, but for general grpc routine, it seems hard.
Yeah perhaps I don't understand what the end state would look like. I think it'd help to have a one pager design doc fleshing out the full plan, and then we can review individual steps?
I'd think naively if you wanted to RR across connections and streams, you could have the connection pool have multiple connections and ALS has multiple stream handles, it can alternate writing protos to one stream and another.
That is what we plan to do. And one more requirement is, as upstream, ALS handler, has multiple pods. Each one of them should process data flow from hundreds Envoys. We hope the connections in the pool could use only part of the upstream list.
Such as N SkyWalking servers to process ALS, envoy ALS could open a pool has X connections. X <= Y base on configuration. The X connections would switch among those N servers periodically.
The problem here is that connection pool connection rotation (either via max requests/streams, max connection length, etc.) doesn't really play well with gRPC streams. If we want to do this in a generic way and max connection length doesn't work, you could potentially do something generic like max bytes or max data frames, etc.
I think my larger question is why isn't max connection length sufficient? This seems like it should work for rotation with gRPC streams.
To sum up the opinions
The current implementation policy (multiplexing connections in a pool and RR messages) is redundant. Therefore, it should be changed to work as follows.
Set the max connection duration in gRPC streaming (I think this is also valid for gRPC streaming), and re-establish another pod's connection when the connection is re-established.
The following strategies for connection re-establishment are currently under consideration.
- simple round robin
- a strategy to equalize the number of bytes sent in a message
If you have any suggestions, please let me know.
@snowp @alyssawilk @mattklein123
For connection re-establishment isn't whatever LB policy is in effect OK? RR, LR, etc?
For connection re-establishment isn't whatever LB policy is in effect OK? RR, LR, etc?
It will be affected with max_stream_duration
but part of my requirement is done with this feature, to rotate gRPC stream. So we can apply H2LB policy such as RR. Other issue is that we should have that If selected host is not available or unhealthy by the report of health check service, Envoy should rebalance the message. Do you have additional concerns about this?
Other issue is that we should have that If selected host is not available or unhealthy by the report of health check service, Envoy should rebalance the message. Do you have additional concerns about this?
I think this is a different feature request. Basically you are asking for connections to be terminated immediately if HC says the host is down. IIRC we already put connections into draining when this happens, but we don't terminate them outright. I think that could be an optional feature on the connection pool.
I think that could be an optional feature on the connection pool.
I think this feature is similar to retry, so we may reuse retry logic after introducing a cluster-specific retry policy.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
We still need this.
Has there been any updates on this issue?
Setting maxStreamDuration and maxConnectionDuration enables connection rotation, Refer to the https://github.com/envoyproxy/envoy/issues/16554#issuecomment-2117776263 .