envoy icon indicating copy to clipboard operation
envoy copied to clipboard

Envoy ALS message rotation on multiple streams

Open Shikugawa opened this issue 3 years ago • 23 comments

This issue proposes to introduce connection re-establishment in the Envoy ALS gRPC client. In the current implementation, the connection between Envoy and the receiver survives for so long that scaling out the receiver does not make sense; in order to perform load balancing of the Envoy ALS receiver, the Async gRPC client of Envoy needs to re-establish the connection at regular intervals or at some To load balance the Envoy ALS receiver, the Async gRPC client of Envoy needs to re-establish the connection at regular intervals or according to some other criteria. When no connection is established, logs are watermark buffered and flushed as needed.

cc. @lizan

Shikugawa avatar Jul 14 '21 06:07 Shikugawa

I had a discussion with @wu-sheng offline, and my idea is just "having multiple streams and rotate/drain one stream periodically at a time" rather than "use/reconnect one stream and buffer logs if stream being unavailable". This way any logs won't get lost during the load balancing/reconnects. The downside here is that maybe the implementation would be complex than just reconnectiong/buffering solution. But I'm confident it is implementable. @wu-sheng @lizan thoughts?

mathetake avatar Jul 14 '21 06:07 mathetake

"having multiple streams and rotate/drain one stream periodically at a time" rather than "use/reconnect one stream and buffer logs if stream being unavailable".

Make sense to me. Envoy could cut unused connections in the pool, and try to re-connect for a new upstream.

wu-sheng avatar Jul 14 '21 08:07 wu-sheng

I had a discussion with @wu-sheng offline, and my idea is just "having multiple streams and rotate/drain one stream periodically at a time" rather than "use/reconnect one stream and buffer logs if stream being unavailable". This way any logs won't get lost during the load balancing/reconnects. The downside here is that maybe the implementation would be complex than just reconnectiong/buffering solution. But I'm confident it is implementable. @wu-sheng @lizan thoughts?

@mathetake @wu-sheng I also think this approach is better in this situation. But I have some questions.

  1. Suppose that you initially establish connections with multiple APM clusters and that messages are balanced evenly across the streams on the connections. Later, when all APM clusters are restarted, the logs will be lost, what are your thoughts on such a case? I don't think there is any need to consider buffering for such a case.

  2. When the number of APM clusters increases due to scale-out, it will be necessary to create new connections. What do you think about the possible performance concerns when such a large number of connections is increased? In such a case, it may be necessary to aggregate the connections in some way, but I don't have an optimal idea for that. Another option would be to configure the system to not establish more than a certain number of connections.

Shikugawa avatar Jul 14 '21 10:07 Shikugawa

I had a discussion with @wu-sheng offline, and my idea is just "having multiple streams and rotate/drain one stream periodically at a time" rather than "use/reconnect one stream and buffer logs if stream being unavailable". This way any logs won't get lost during the load balancing/reconnects. The downside here is that maybe the implementation would be complex than just reconnectiong/buffering solution. But I'm confident it is implementable. @wu-sheng @lizan thoughts?

@mathetake @wu-sheng I also think this approach is better in this situation. But I have some questions.

  1. Suppose that you initially establish connections with multiple APM clusters and that messages are balanced evenly across the streams on the connections. Later, when all APM clusters are restarted, the logs will be lost, what are your thoughts on such a case? I don't think there is any need to consider buffering for such a case.

If all skywalking reboots, then it is not envoy responsibility.

  1. When the number of APM clusters increases due to scale-out, it will be necessary to create new connections. What do you think about the possible performance concerns when such a large number of connections is increased? In such a case, it may be necessary to aggregate the connections in some way, but I don't have an optimal idea for that. Another option would be to configure the system to not establish more than a certain number of connections.

Yes, we should provide a configuration to set threshold for max connection to upstream.

wu-sheng avatar Jul 14 '21 11:07 wu-sheng

ok. We may have any points that we need further discussion, but I can take this task.

Shikugawa avatar Jul 14 '21 14:07 Shikugawa

I think we're mixing two issues here:

  1. logs are not buffered for retry/connection loss.
  2. long-live ALS streams are not load balanced

For 1. it needs modification to existing ALS service, https://github.com/envoyproxy/envoy/blob/main/api/envoy/service/accesslog/v3/als.proto#L22-L27 to let ALS service ACK the receipt. For 2. implement a reconnecting after certain time/number of request will be great.

The two solution should be configurable at the same time, then the behavior would be much like "having multiple streams and rotate/drain one stream periodically at a time".

lizan avatar Jul 14 '21 21:07 lizan

I agree with ^. I thought I had a TODO on connection rotation but I can't find it.

mattklein123 avatar Jul 14 '21 22:07 mattklein123

Reading https://github.com/envoyproxy/envoy/issues/17331 the reason we need this sounds like we want connection rotation? Would it make sense to implement this connection rotation core instead of exposing additional implementation details to ALS and have it implement its own rotation logic?

@mattklein123 @alyssawilk

snowp avatar Jul 20 '21 14:07 snowp

+1, we've had other requests for load balancing across connections (e.g https://github.com/envoyproxy/envoy/issues/16426) so I think we want to add this functionality to the connection pool

alyssawilk avatar Jul 20 '21 14:07 alyssawilk

AFAIK, connection rotation is hard if base on grpc streaming request. ALS client would know how to routine, but for general grpc routine, it seems hard.

wu-sheng avatar Jul 20 '21 14:07 wu-sheng

Yeah perhaps I don't understand what the end state would look like. I think it'd help to have a one pager design doc fleshing out the full plan, and then we can review individual steps?

I'd think naively if you wanted to RR across connections and streams, you could have the connection pool have multiple connections and ALS has multiple stream handles, it can alternate writing protos to one stream and another.

alyssawilk avatar Jul 20 '21 14:07 alyssawilk

That is what we plan to do. And one more requirement is, as upstream, ALS handler, has multiple pods. Each one of them should process data flow from hundreds Envoys. We hope the connections in the pool could use only part of the upstream list.

Such as N SkyWalking servers to process ALS, envoy ALS could open a pool has X connections. X <= Y base on configuration. The X connections would switch among those N servers periodically.

wu-sheng avatar Jul 20 '21 14:07 wu-sheng

The problem here is that connection pool connection rotation (either via max requests/streams, max connection length, etc.) doesn't really play well with gRPC streams. If we want to do this in a generic way and max connection length doesn't work, you could potentially do something generic like max bytes or max data frames, etc.

I think my larger question is why isn't max connection length sufficient? This seems like it should work for rotation with gRPC streams.

mattklein123 avatar Jul 20 '21 16:07 mattklein123

To sum up the opinions

The current implementation policy (multiplexing connections in a pool and RR messages) is redundant. Therefore, it should be changed to work as follows.

Set the max connection duration in gRPC streaming (I think this is also valid for gRPC streaming), and re-establish another pod's connection when the connection is re-established.

The following strategies for connection re-establishment are currently under consideration.

  1. simple round robin
  2. a strategy to equalize the number of bytes sent in a message

If you have any suggestions, please let me know.

@snowp @alyssawilk @mattklein123

Shikugawa avatar Jul 21 '21 01:07 Shikugawa

For connection re-establishment isn't whatever LB policy is in effect OK? RR, LR, etc?

mattklein123 avatar Jul 21 '21 14:07 mattklein123

For connection re-establishment isn't whatever LB policy is in effect OK? RR, LR, etc?

It will be affected with max_stream_duration but part of my requirement is done with this feature, to rotate gRPC stream. So we can apply H2LB policy such as RR. Other issue is that we should have that If selected host is not available or unhealthy by the report of health check service, Envoy should rebalance the message. Do you have additional concerns about this?

Shikugawa avatar Jul 21 '21 15:07 Shikugawa

Other issue is that we should have that If selected host is not available or unhealthy by the report of health check service, Envoy should rebalance the message. Do you have additional concerns about this?

I think this is a different feature request. Basically you are asking for connections to be terminated immediately if HC says the host is down. IIRC we already put connections into draining when this happens, but we don't terminate them outright. I think that could be an optional feature on the connection pool.

mattklein123 avatar Jul 21 '21 15:07 mattklein123

I think that could be an optional feature on the connection pool.

I think this feature is similar to retry, so we may reuse retry logic after introducing a cluster-specific retry policy.

Shikugawa avatar Jul 21 '21 17:07 Shikugawa

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 20 '21 20:08 github-actions[bot]

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] avatar Sep 20 '21 00:09 github-actions[bot]

We still need this.

wu-sheng avatar Sep 20 '21 01:09 wu-sheng

Has there been any updates on this issue?

joelmathew003 avatar Apr 12 '24 05:04 joelmathew003

Setting maxStreamDuration and maxConnectionDuration enables connection rotation, Refer to the https://github.com/envoyproxy/envoy/issues/16554#issuecomment-2117776263 .

lonfme avatar May 17 '24 15:05 lonfme