osm Switch to a per-proxy coalescing mechanism, instead of a global coale…

trafficstars

This PR does 2 things:

Pushes message coalescing from a global message broker to a per-proxy goroutine.
Remove the ticker code in favor of a per-proxy ticker.

This has the following benefits:

~700 lines of code deleted, results in huge simplifications. The message broker's coalescing code is arguably the single most complicated chunk of code in the code base, the new PR is significantly simpler and easier to reason about.
Coalescing per-proxy, reduces wait times per proxy, since a proxy doesn't suffer from head of line blocking.
It should not be the job of the message broker to decide what messges the user may be interested in. By passing all messages we let the proxies make smarter decisions, and improve our package level abstractions.
It's a two-for-one on removing head of line blocking, where a single slow-to-update proxy forces all other proxies to wait.
Currently, proxies wait up to 10 seconds to recieve an update due to the sliding window algorithm. We can flip that, where proxies get updated at the beginning of the window, instead of the end.
There's several layers of indirection between the event queue, worker queue, pubsub, workerpool, and finally the actual update. We can bring that down to 2 components.
This simultaneously fixes https://github.com/openservicemesh/osm/issues/4623

Oct 12 '22 16:10 steeling

@jsturtevant @shashankram you may be interested in this PR.

Oct 12 '22 16:10 steeling

This needs to be thoroughly reviewed and assessed before it makes it into the repo. Adding the do-not-merge label till I get the chance to review this and the motivation for the change.

Oct 12 '22 16:10 shashankram

This PR does 2 things:

Pushes message coalescing from a global message broker to a per-proxy goroutine.

Could you measure the performance impact at scale with this change? I'd be interested in understanding the bottleneck and how much improvement this change results in.

Remove the ticker code in favor of a per-proxy ticker.

Could you describe the motivation for this change? Why is a per-proxy ticker required? Ticker is currently only used to recover from event-propagation bugs in the code, to recover the system through periodic resyncs. I don't understand the need to optimize this code path, and if there is a different reason to do so.

This has the following benefits:

~700 lines of code deleted, results in huge simplifications. The message broker's coalescing code is arguably the single most complicated chunk of code in the code base, the new PR is significantly simpler and easier to reason about.

I disagree with the assessment that "coalescing code is arguably the single most complicated chunk of code in the code base". In my opinion, it's rather a well-contained change scoped to do what it is meant to do - coalesce events received in close proximity to reduce the number of times proxy configurations are computed.

Coalescing per-proxy, reduces wait times per proxy, since a proxy doesn't suffer from head of line blocking.

Have you measured the impact of the wait-time? What order of magnitude in units of time are we talking about here?

It should not be the job of the message broker to decide what messges the user may be interested in. By passing all messages we let the proxies make smarter decisions, and improve our package level abstractions.

I disagree. The job of the broker was to exactly do this - streamline event propagation. What smarter decisions do the proxies need to take other than computing the config based on events?

It's a two-for-one on removing head of line blocking, where a single slow-to-update proxy forces all other proxies to wait.

Previously, we had a dedicated set of workers that handle the proxy configuration updates for all proxies. The HOL blocking would happen more so due to serialized computation of updates. Would parallelizing this suffice to reduce this problem? Could you share some numbers on how much improvement are we talking about?

Currently, proxies wait up to 10 seconds to recieve an update due to the sliding window algorithm. We can flip that, where proxies get updated at the beginning of the window, instead of the end.

What's the need to flip this? Is there an additional benefit not described here?

There's several layers of indirection between the event queue, worker queue, pubsub, workerpool, and finally the actual update. We can bring that down to 2 components.

This simultaneously fixes Event handling retains memory under moderate load #4623

Could you describe how this fixes the memory retention issue?

Oct 12 '22 17:10 shashankram

Pushes message coalescing from a global message broker to a per-proxy goroutine.

Reading through https://github.com/openservicemesh/osm/issues/4623 it seems like it was using the ratelimiter queue to buffer events from the k8s informers. I am not seeing that logic here but maybe I am missing it?

I agree we should run through some tests (maybe similar to the ones @nojnhuh ran before)? Can we run this against the load tests as well?

Oct 12 '22 18:10 jsturtevant

There's a lot of requests to provide performance numbers. I get the desire to do that, however we unfortunately lack good tooling to test this. I'll try to add some benchmarks later, but I'd prefer to not block the PR on that. Other responses in-line

This PR does 2 things:

Pushes message coalescing from a global message broker to a per-proxy goroutine.

Could you measure the performance impact at scale with this change? I'd be interested in understanding the bottleneck and how much improvement this change results in.

Remove the ticker code in favor of a per-proxy ticker.

Could you describe the motivation for this change? Why is a per-proxy ticker required? Ticker is currently only used to recover from event-propagation bugs in the code, to recover the system through periodic resyncs. I don't understand the need to optimize this code path, and if there is a different reason to do so.

I don't need to include it here, but having a ticker on a global level might force a proxy to re-update when it has recently re-updated, but the rest haven't. Bringing it down to per-proxy allows us to have less updates.

This has the following benefits:

~700 lines of code deleted, results in huge simplifications. The message broker's coalescing code is arguably the single most complicated chunk of code in the code base, the new PR is significantly simpler and easier to reason about.

I disagree with the assessment that "coalescing code is arguably the single most complicated chunk of code in the code base". In my opinion, it's rather a well-contained change scoped to do what it is meant to do - coalesce events received in close proximity to reduce the number of times proxy configurations are computed.

Let's get some other input here, I have a hard time groking the semantics of what the code does myself, and I'd suspect others do as well.

Coalescing per-proxy, reduces wait times per proxy, since a proxy doesn't suffer from head of line blocking.

Have you measured the impact of the wait-time? What order of magnitude in units of time are we talking about here?

Not sure, but I'm happy with any improvement if the code is ready to go. I think knowing the magnitude of the improvement is useful for planning purposes to pick out low hanging fruit, but if we see an improvement I'm not sure how much we need to measure it, except for to pat ourselves on the back for a job well done?

It should not be the job of the message broker to decide what messges the user may be interested in. By passing all messages we let the proxies make smarter decisions, and improve our package level abstractions.

I disagree. The job of the broker was to exactly do this - streamline event propagation. What smarter decisions do the proxies need to take other than computing the config based on events?

But that's the whole reason for packages. To create separation of concern. One package should not have insight into other packages, or how other packages may use this package. This fundamentally creates weak levels of abstraction.

It's a two-for-one on removing head of line blocking, where a single slow-to-update proxy forces all other proxies to wait.

Previously, we had a dedicated set of workers that handle the proxy configuration updates for all proxies. The HOL blocking would happen more so due to serialized computation of updates. Would parallelizing this suffice to reduce this problem? Could you share some numbers on how much improvement are we talking about?

It's the parallelization + the fact that there is no longer any blocking code when reading from the message broker. They TryLock accomplishes this. Previously there were several places we might block.

RateLimitedAdd from informer -> message broker (this caused #4623)
Adding to the pubsub from the message broker, if another proxy was performing an update
Reading from the cert rotation pub sub in the proxy update routine, if it's waiting on another update to finish. This can block future calls to IssueCertificate (for all certificates) if it gets backed up. This also causes blocking on the proxy update pubsub.
Reading from the proxy update pubsub. Also causes blocking on the cert rotation pubsub.

With no blocking we will alleviate memory pressure, GC, and wait times.

Currently, proxies wait up to 10 seconds to recieve an update due to the sliding window algorithm. We can flip that, where proxies get updated at the beginning of the window, instead of the end.

What's the need to flip this? Is there an additional benefit not described here?

I'd say a proxy not waiting for 10s to get an update, and instead getting an update immediately is a much better user experience.

There's several layers of indirection between the event queue, worker queue, pubsub, workerpool, and finally the actual update. We can bring that down to 2 components.

This simultaneously fixes Event handling retains memory under moderate load #4623

Could you describe how this fixes the memory retention issue?

With no blocking we should drop messages very quickly, without backing up parts of the stack (informer's unbounded and non-shrinking ring buffer and workerqueues unbounded queue).

Oct 12 '22 19:10 steeling

Pushes message coalescing from a global message broker to a per-proxy goroutine.

Reading through #4623 it seems like it was using the ratelimiter queue to buffer events from the k8s informers. I am not seeing that logic here but maybe I am missing it?

I agree we should run through some tests (maybe similar to the ones @nojnhuh ran before)? Can we run this against the load tests as well?

#4623 claims that the rate limiter queue is what's causing the issue. This removes that. See the comment above that goes into more detail

Oct 12 '22 19:10 steeling

Adding an excerpt from a comment above:

Simply put, we shouldn't block on any routine that reads from a channel. This alleviates 4 instances of such blocking:

RateLimitedAdd from informer -> message broker (this caused https://github.com/openservicemesh/osm/issues/4623)
Adding to the pubsub from the message broker, if another proxy was performing an update. This can prevent other proxies from receiving timely updates, reading from their cert rotation pubsub (thus blocking any future calls to IssueCertificate), and puts memory pressure on the workerqueues internal, unbounded, queue
Reading from the cert rotation pub sub in the proxy update routine, if it's waiting on another update to finish. This can block future calls to IssueCertificate (for all certificates) if it gets backed up. This also causes blocking on the proxy update pubsub.
Reading from the proxy update pubsub. Also causes blocking on the cert rotation pubsub.

Oct 12 '22 19:10 steeling

#4623 claims that the rate limiter queue is what's causing the issue. This removes that. See the comment above that goes into more detail

got it, lots to take in. I saw comments around it buffering events form K8s but after re-reading (and now with more understanding) it seems we may have been mis using the RateLimiting queue (this seems to be due to the fact that we are actually not doing level based processing for the kube based events). We put a large object onto the queue, and we should really be putting just a key value for the object so we get the advantage that the docs call out Stingy: a single item will not be processed multiple times concurrently, and if an item is added multiple times before it can be processed, it will only be processed once.

I guess my concern was that if we get tons of events from k8s can we ensure that we are not overloading proxy updates? I think the trylock would allow for throwing away some of the events but I am not sure it blocks a sustained high volume set of events (which is what runProxyUpdateDispatcher did I think?)

RateLimitedAdd from informer -> message broker (this caused https://github.com/openservicemesh/osm/issues/4623)

I was digging through the code I am didn't quit figure out how AddRateLimited caused this the internal ring buffer to grow, but the growth does seem to be what is causing the issue (they call that out explicitly )

Adding to the pubsub from the message broker, if another proxy was performing an update. This can prevent other proxies from receiving timely updates, reading from their cert rotation pubsub (thus blocking any future calls to IssueCertificate), and puts memory pressure on the workerqueues internal, unbounded, queue

Reading from the cert rotation pub sub in the proxy update routine, if it's waiting on another update to finish. This can block future calls to IssueCertificate (for all certificates) if it gets backed up. This also causes blocking on the proxy update pubsub.

Reading from the proxy update pubsub. Also causes blocking on the cert rotation pubsub.

This blocking looks to be coming from https://github.com/openservicemesh/osm/blob/88af2f10f928f34221b67a671576d48936a75aab/pkg/osm/callbacks.go#L66-L71 and https://github.com/openservicemesh/osm/blob/88af2f10f928f34221b67a671576d48936a75aab/pkg/osm/callbacks.go#L82-L94 if am understanding properly?

Oct 12 '22 22:10 jsturtevant

There's a lot of requests to provide performance numbers. I get the desire to do that, however we unfortunately lack good tooling to test this.

I thought we had load testing? I would think we would see improvement there?

Also if this fixes https://github.com/openservicemesh/osm/issues/4623 we should be able to re-run the experiments there and see improvement?

Oct 12 '22 22:10 jsturtevant

We put a large object onto the queue, and we should really be putting just a key value for the object so we get the advantage that the docs call out Stingy: a single item will not be processed multiple times concurrently, and if an item is added multiple times before it can be processed, it will only be processed once.

Can you clarify what you mean with a key value? Channel's have the same semantics of single item processed once, so we're still good on that front. The adding an item multiple times is interesting, but not critical IMO.

I guess my concern was that if we get tons of events from k8s can we ensure that we are not overloading proxy updates? I think the trylock would allow for throwing away some of the events but I am not sure it blocks a sustained high volume set of events (which is what runProxyUpdateDispatcher did I think?)

The proxyUpdater will drop all updates while it's update is running. As soon as its update finishes, if another update comes in, it will immediately schedule another update on the worker pool (this does not necessarily execute immediately), and then begin dropping messages again, until the update is completed (including during the time it is sitting on the worker pool). So it will drop messages at a very fast rate. I'll look to add a benchmark for this as well.

I was digging through the code I am didn't quit figure out how AddRateLimited caused this the internal ring buffer to grow, but the growth does seem to be what is causing the issue (they call that out explicitly )

The rate limiting add, will block reading from the event handler. We've seen this trigger when there are more k8s events / s than our rate limiter allows.

Oct 12 '22 23:10 steeling

Can you clarify what you mean with a key value? Channel's have the same semantics of single item processed once, so we're still good on that front. The adding an item multiple times is interesting, but not critical IMO.

In my experience, controllers (k8s upstream controllers and ones created by frameworks like kubebuilder) are written to handle events from the informer by creating namespaced keys. Examples would be add event for deployment which calls enque which used namespace/objectname to put it on the queue

When I say it only gets added once, the queue is backed by a map store so if it is a simple string (as is in the example above) it would be only added once but it is added to the queue multiple times

Since we store a complex object on the queue we don't get that benefit as it's always going to be different. We are not level based as we use the object that we stored instead of looking it up at time of processing.

Oct 12 '22 23:10 jsturtevant

The rate limiting add, will block reading from the event handler. We've seen this trigger when there are more k8s events / s than our rate limiter allows.

ah interesting, I didn't realize it was a blocking calling...

update: I see it now, addafter puts into buffered channel: https://github.com/kubernetes/client-go/blob/e6d958c7a853e88554b8f859cfb32df07d867f3e/util/workqueue/delaying_queue.go#L66 and that would block while draining which would grow the ring buffer

The proxyUpdater will drop all updates while it's update is running. As soon as its update finishes, if another update comes in, it will immediately schedule another update on the worker pool (this does not necessarily execute immediately),

thanks this makes sense. Under high load across many proxies, will the workpooler get overwhelmed and just be continuously processing? trying to understand what this looks like in practice

Oct 13 '22 00:10 jsturtevant

There's a lot of requests to provide performance numbers. I get the desire to do that, however we unfortunately lack good tooling to test this. I'll try to add some benchmarks later, but I'd prefer to not block the PR on that. Other responses in-line

This PR does 2 things:

Pushes message coalescing from a global message broker to a per-proxy goroutine.

Could you measure the performance impact at scale with this change? I'd be interested in understanding the bottleneck and how much improvement this change results in.

Remove the ticker code in favor of a per-proxy ticker.

Could you describe the motivation for this change? Why is a per-proxy ticker required? Ticker is currently only used to recover from event-propagation bugs in the code, to recover the system through periodic resyncs. I don't understand the need to optimize this code path, and if there is a different reason to do so.

I don't need to include it here, but having a ticker on a global level might force a proxy to re-update when it has recently re-updated, but the rest haven't. Bringing it down to per-proxy allows us to have less updates.

Given the Ticker component is meant to force events through when bugs exist, I am not in favor of optimizing this code into something overly complex. Best to keep tackle and evaluate this separately, independent of the event propagation mechanism.

How does per-proxy result in less updates, if the goal of Ticker is to always update the proxy? Is the behavior of Ticker being updated to something different than what it was originally intended for?

This has the following benefits:

~700 lines of code deleted, results in huge simplifications. The message broker's coalescing code is arguably the single most complicated chunk of code in the code base, the new PR is significantly simpler and easier to reason about.

I disagree with the assessment that "coalescing code is arguably the single most complicated chunk of code in the code base". In my opinion, it's rather a well-contained change scoped to do what it is meant to do - coalesce events received in close proximity to reduce the number of times proxy configurations are computed.

Let's get some other input here, I have a hard time groking the semantics of what the code does myself, and I'd suspect others do as well.

Coalescing per-proxy, reduces wait times per proxy, since a proxy doesn't suffer from head of line blocking.

Have you measured the impact of the wait-time? What order of magnitude in units of time are we talking about here?

Not sure, but I'm happy with any improvement if the code is ready to go. I think knowing the magnitude of the improvement is useful for planning purposes to pick out low hanging fruit, but if we see an improvement I'm not sure how much we need to measure it, except for to pat ourselves on the back for a job well done?

Sorry, I disagree that measuring the improvement is not important. Without this, it's hard to quantify whether this change even results in an improvement.

It should not be the job of the message broker to decide what messges the user may be interested in. By passing all messages we let the proxies make smarter decisions, and improve our package level abstractions.

I disagree. The job of the broker was to exactly do this - streamline event propagation. What smarter decisions do the proxies need to take other than computing the config based on events?

But that's the whole reason for packages. To create separation of concern. One package should not have insight into other packages, or how other packages may use this package. This fundamentally creates weak levels of abstraction.

I don't see anything wrong with the broker doing what it was meant to do - streamline event processing and filtering, such that the proxies don't need to be aware of event types they don't need to know about (e.g. k8s specific ones). This doesn't mean one package has insights into another. The broker is simply a middleware.

It's a two-for-one on removing head of line blocking, where a single slow-to-update proxy forces all other proxies to wait.

Previously, we had a dedicated set of workers that handle the proxy configuration updates for all proxies. The HOL blocking would happen more so due to serialized computation of updates. Would parallelizing this suffice to reduce this problem? Could you share some numbers on how much improvement are we talking about?

It's the parallelization + the fact that there is no longer any blocking code when reading from the message broker. They TryLock accomplishes this. Previously there were several places we might block.

RateLimitedAdd from informer -> message broker (this caused Event handling retains memory under moderate load #4623)

Adding to the pubsub from the message broker, if another proxy was performing an update

Reading from the cert rotation pub sub in the proxy update routine, if it's waiting on another update to finish. This can block future calls to IssueCertificate (for all certificates) if it gets backed up. This also causes blocking on the proxy update pubsub.

Reading from the proxy update pubsub. Also causes blocking on the cert rotation pubsub.

With no blocking we will alleviate memory pressure, GC, and wait times.

Reducing/avoiding blocking seems like a good optimization. Will have to review the code to understand the new model.

Currently, proxies wait up to 10 seconds to recieve an update due to the sliding window algorithm. We can flip that, where proxies get updated at the beginning of the window, instead of the end.

What's the need to flip this? Is there an additional benefit not described here?

I'd say a proxy not waiting for 10s to get an update, and instead getting an update immediately is a much better user experience.

The 10s wait only happens if there are subsequent events received in within that 10s window, not everytime. I disagree this is a better user experience, because in the case where a stream of events are not received in the 10s interval, they are still dispatched after the proxyUpdateSlidingWindow (2s) window. 10s is the max window wait time before a proxy will be updated, it's not the best/average case scenario.

There's several layers of indirection between the event queue, worker queue, pubsub, workerpool, and finally the actual update. We can bring that down to 2 components.

This simultaneously fixes Event handling retains memory under moderate load #4623

Could you describe how this fixes the memory retention issue?

With no blocking we should drop messages very quickly, without backing up parts of the stack (informer's unbounded and non-shrinking ring buffer and workerqueues unbounded queue).

So are you suggesting the events not be rate limited anymore?

Oct 13 '22 14:10 shashankram

There's a lot of requests to provide performance numbers. I get the desire to do that, however we unfortunately lack good tooling to test this. I'll try to add some benchmarks later, but I'd prefer to not block the PR on that. Other responses in-line

This PR does 2 things:

Pushes message coalescing from a global message broker to a per-proxy goroutine.

Could you measure the performance impact at scale with this change? I'd be interested in understanding the bottleneck and how much improvement this change results in.

Remove the ticker code in favor of a per-proxy ticker.

Could you describe the motivation for this change? Why is a per-proxy ticker required? Ticker is currently only used to recover from event-propagation bugs in the code, to recover the system through periodic resyncs. I don't understand the need to optimize this code path, and if there is a different reason to do so.

I don't need to include it here, but having a ticker on a global level might force a proxy to re-update when it has recently re-updated, but the rest haven't. Bringing it down to per-proxy allows us to have less updates.

Given the Ticker component is meant to force events through when bugs exist, I am not in favor of optimizing this code into something overly complex. Best to keep tackle and evaluate this separately, independent of the event propagation mechanism.

How does per-proxy result in less updates, if the goal of Ticker is to always update the proxy? Is the behavior of Ticker being updated to something different than what it was originally intended for?

Instead of sending a broadcast on some frequency, I send make the ticker aware of the last update, so it will only tick after some time since the last update, resetting the ticker on each update.

Agreed I can revisit this in a later PR.

This has the following benefits:

~700 lines of code deleted, results in huge simplifications. The message broker's coalescing code is arguably the single most complicated chunk of code in the code base, the new PR is significantly simpler and easier to reason about.

I disagree with the assessment that "coalescing code is arguably the single most complicated chunk of code in the code base". In my opinion, it's rather a well-contained change scoped to do what it is meant to do - coalesce events received in close proximity to reduce the number of times proxy configurations are computed.

Let's get some other input here, I have a hard time groking the semantics of what the code does myself, and I'd suspect others do as well.

Coalescing per-proxy, reduces wait times per proxy, since a proxy doesn't suffer from head of line blocking.

Have you measured the impact of the wait-time? What order of magnitude in units of time are we talking about here?

Not sure, but I'm happy with any improvement if the code is ready to go. I think knowing the magnitude of the improvement is useful for planning purposes to pick out low hanging fruit, but if we see an improvement I'm not sure how much we need to measure it, except for to pat ourselves on the back for a job well done?

Sorry, I disagree that measuring the improvement is not important. Without this, it's hard to quantify whether this change even results in an improvement.

Not having a proxy block on other proxy updates is an improvement, but will try to add some numbers here to demonstrate

It should not be the job of the message broker to decide what messges the user may be interested in. By passing all messages we let the proxies make smarter decisions, and improve our package level abstractions.

I disagree. The job of the broker was to exactly do this - streamline event propagation. What smarter decisions do the proxies need to take other than computing the config based on events?

But that's the whole reason for packages. To create separation of concern. One package should not have insight into other packages, or how other packages may use this package. This fundamentally creates weak levels of abstraction.

I don't see anything wrong with the broker doing what it was meant to do - streamline event processing and filtering, such that the proxies don't need to be aware of event types they don't need to know about (e.g. k8s specific ones). This doesn't mean one package has insights into another. The broker is simply a middleware.

It's a two-for-one on removing head of line blocking, where a single slow-to-update proxy forces all other proxies to wait.

Previously, we had a dedicated set of workers that handle the proxy configuration updates for all proxies. The HOL blocking would happen more so due to serialized computation of updates. Would parallelizing this suffice to reduce this problem? Could you share some numbers on how much improvement are we talking about?

It's the parallelization + the fact that there is no longer any blocking code when reading from the message broker. They TryLock accomplishes this. Previously there were several places we might block.

RateLimitedAdd from informer -> message broker (this caused Event handling retains memory under moderate load #4623)

Adding to the pubsub from the message broker, if another proxy was performing an update

Reading from the cert rotation pub sub in the proxy update routine, if it's waiting on another update to finish. This can block future calls to IssueCertificate (for all certificates) if it gets backed up. This also causes blocking on the proxy update pubsub.

Reading from the proxy update pubsub. Also causes blocking on the cert rotation pubsub.

With no blocking we will alleviate memory pressure, GC, and wait times.

Reducing/avoiding blocking seems like a good optimization. Will have to review the code to understand the new model.

Currently, proxies wait up to 10 seconds to recieve an update due to the sliding window algorithm. We can flip that, where proxies get updated at the beginning of the window, instead of the end.

What's the need to flip this? Is there an additional benefit not described here?

I'd say a proxy not waiting for 10s to get an update, and instead getting an update immediately is a much better user experience.

The 10s wait only happens if there are subsequent events received in within that 10s window, not everytime. I disagree this is a better user experience, because in the case where a stream of events are not received in the 10s interval, they are still dispatched after the proxyUpdateSlidingWindow (2s) window. 10s is the max window wait time before a proxy will be updated, it's not the best/average case scenario.

Yup, so 2s best case, 10s worst case, this improves both best and worst case.

There's several layers of indirection between the event queue, worker queue, pubsub, workerpool, and finally the actual update. We can bring that down to 2 components.

This simultaneously fixes Event handling retains memory under moderate load #4623

Could you describe how this fixes the memory retention issue?

With no blocking we should drop messages very quickly, without backing up parts of the stack (informer's unbounded and non-shrinking ring buffer and workerqueues unbounded queue).

So are you suggesting the events not be rate limited anymore?

Yup! This is a simple comp-sci style, big O problem. We're already doing the deserialization (O(NM) operations for N events and average M bytes per k8s event). We have proof that the rate limiter is detrimental to our memory performance, and that removing it fixed it. This adds a constant factor in that we pass each event around a further step onto the pubsub, but its then dropped. So overall it becomes c O(NM), which is still O(NM).

Oct 13 '22 16:10 steeling

We have proof that the rate limiter is detrimental to our memory performance,

The way we are using the rate limiter workqueue is causing issues as we are not following the pattern in the way it was intended to be use.

Our current proxy update is essentially level based, but we have other areas that are not and will now get more events that are not buffered in anyway (https://github.com/openservicemesh/osm/blob/88af2f10f928f34221b67a671576d48936a75aab/pkg/ingress/gateway.go#L65-L76). could this be an issue?

Oct 13 '22 16:10 jsturtevant

We have proof that the rate limiter is detrimental to our memory performance,

The way we are using the rate limiter workqueue is causing issues as we are not following the pattern in the way it was intended to be use.

Our current proxy update is essentially level based, but we have other areas that are not and will now get more events that are not buffered in anyway (

https://github.com/openservicemesh/osm/blob/88af2f10f928f34221b67a671576d48936a75aab/pkg/ingress/gateway.go#L65-L76

). could this be an issue?

Note that it's only the proxy update pubsub that is coalesced, while the kube events pubsub is not, so no changes for those uses

Oct 13 '22 17:10 steeling

Note that it's only the proxy update pubsub that is coalesced, while the kube events pubsub is not, so no changes for those uses

the events were being rate limited by the workqueue now they are being pushed directly through the pubsub so I think it is a change in behavior?

Oct 13 '22 18:10 jsturtevant

Note that it's only the proxy update pubsub that is coalesced, while the kube events pubsub is not, so no changes for those uses

the events were being rate limited by the workqueue now they are being pushed directly through the pubsub so I think it is a change in behavior?

ah, sorry i got mixed up with rate limiting and coalescing.

Yes, they will no longer be rate limited, although I expect that is closer to what we desire.

Faster responses to an event happening, and it looks like we don't have a concern for causing overload due to the serial nature of these subscriptions

Oct 13 '22 19:10 steeling

Note that the rea

There's a lot of requests to provide performance numbers. I get the desire to do that, however we unfortunately lack good tooling to test this. I'll try to add some benchmarks later, but I'd prefer to not block the PR on that. Other responses in-line

This PR does 2 things:

Pushes message coalescing from a global message broker to a per-proxy goroutine.

Could you measure the performance impact at scale with this change? I'd be interested in understanding the bottleneck and how much improvement this change results in.

Remove the ticker code in favor of a per-proxy ticker.

Could you describe the motivation for this change? Why is a per-proxy ticker required? Ticker is currently only used to recover from event-propagation bugs in the code, to recover the system through periodic resyncs. I don't understand the need to optimize this code path, and if there is a different reason to do so.

I don't need to include it here, but having a ticker on a global level might force a proxy to re-update when it has recently re-updated, but the rest haven't. Bringing it down to per-proxy allows us to have less updates.

Given the Ticker component is meant to force events through when bugs exist, I am not in favor of optimizing this code into something overly complex. Best to keep tackle and evaluate this separately, independent of the event propagation mechanism. How does per-proxy result in less updates, if the goal of Ticker is to always update the proxy? Is the behavior of Ticker being updated to something different than what it was originally intended for?

Instead of sending a broadcast on some frequency, I send make the ticker aware of the last update, so it will only tick after some time since the last update, resetting the ticker on each update.

Agreed I can revisit this in a later PR.

This has the following benefits:

~700 lines of code deleted, results in huge simplifications. The message broker's coalescing code is arguably the single most complicated chunk of code in the code base, the new PR is significantly simpler and easier to reason about.

I disagree with the assessment that "coalescing code is arguably the single most complicated chunk of code in the code base". In my opinion, it's rather a well-contained change scoped to do what it is meant to do - coalesce events received in close proximity to reduce the number of times proxy configurations are computed.

Let's get some other input here, I have a hard time groking the semantics of what the code does myself, and I'd suspect others do as well.

Coalescing per-proxy, reduces wait times per proxy, since a proxy doesn't suffer from head of line blocking.

Have you measured the impact of the wait-time? What order of magnitude in units of time are we talking about here?

Not sure, but I'm happy with any improvement if the code is ready to go. I think knowing the magnitude of the improvement is useful for planning purposes to pick out low hanging fruit, but if we see an improvement I'm not sure how much we need to measure it, except for to pat ourselves on the back for a job well done?

Sorry, I disagree that measuring the improvement is not important. Without this, it's hard to quantify whether this change even results in an improvement.

Not having a proxy block on other proxy updates is an improvement, but will try to add some numbers here to demonstrate

If we are still using workerpools, then we aren't going to wake up all the N proxies at once anyway. This is not desirable, which is why we have the workerpool implementation to limit how many goroutines are actually dedicated for proxy config computation. I am interested in understand what level of improvement are we talking about, and whether a rearchitecture is warranted in this regard.

It should not be the job of the message broker to decide what messges the user may be interested in. By passing all messages we let the proxies make smarter decisions, and improve our package level abstractions.

I disagree. The job of the broker was to exactly do this - streamline event propagation. What smarter decisions do the proxies need to take other than computing the config based on events?

But that's the whole reason for packages. To create separation of concern. One package should not have insight into other packages, or how other packages may use this package. This fundamentally creates weak levels of abstraction.

I don't see anything wrong with the broker doing what it was meant to do - streamline event processing and filtering, such that the proxies don't need to be aware of event types they don't need to know about (e.g. k8s specific ones). This doesn't mean one package has insights into another. The broker is simply a middleware.

It's a two-for-one on removing head of line blocking, where a single slow-to-update proxy forces all other proxies to wait.

Previously, we had a dedicated set of workers that handle the proxy configuration updates for all proxies. The HOL blocking would happen more so due to serialized computation of updates. Would parallelizing this suffice to reduce this problem? Could you share some numbers on how much improvement are we talking about?

It's the parallelization + the fact that there is no longer any blocking code when reading from the message broker. They TryLock accomplishes this. Previously there were several places we might block.

RateLimitedAdd from informer -> message broker (this caused Event handling retains memory under moderate load #4623)

Adding to the pubsub from the message broker, if another proxy was performing an update

Reading from the cert rotation pub sub in the proxy update routine, if it's waiting on another update to finish. This can block future calls to IssueCertificate (for all certificates) if it gets backed up. This also causes blocking on the proxy update pubsub.

Reading from the proxy update pubsub. Also causes blocking on the cert rotation pubsub.

With no blocking we will alleviate memory pressure, GC, and wait times.

Reducing/avoiding blocking seems like a good optimization. Will have to review the code to understand the new model.

Currently, proxies wait up to 10 seconds to recieve an update due to the sliding window algorithm. We can flip that, where proxies get updated at the beginning of the window, instead of the end.

What's the need to flip this? Is there an additional benefit not described here?

I'd say a proxy not waiting for 10s to get an update, and instead getting an update immediately is a much better user experience.

The 10s wait only happens if there are subsequent events received in within that 10s window, not everytime. I disagree this is a better user experience, because in the case where a stream of events are not received in the 10s interval, they are still dispatched after the proxyUpdateSlidingWindow (2s) window. 10s is the max window wait time before a proxy will be updated, it's not the best/average case scenario.

Yup, so 2s best case, 10s worst case, this improves both best and worst case.

The goal is to batch all events received within the 2s and 10s window, which the current code accomplishes. Why would we want to fire an event first and then batch the remaining? Is there a particular scenario you are tying to solve here?

There's several layers of indirection between the event queue, worker queue, pubsub, workerpool, and finally the actual update. We can bring that down to 2 components.

This simultaneously fixes Event handling retains memory under moderate load #4623

Could you describe how this fixes the memory retention issue?

With no blocking we should drop messages very quickly, without backing up parts of the stack (informer's unbounded and non-shrinking ring buffer and workerqueues unbounded queue).

So are you suggesting the events not be rate limited anymore?

Yup! This is a simple comp-sci style, big O problem. We're already doing the deserialization (O(N_M) operations for N events and average M bytes per k8s event). We have proof that the rate limiter is detrimental to our memory performance, and that removing it fixed it. This adds a constant factor in that we pass each event around a further step onto the pubsub, but its then dropped. So overall it becomes c_ O(N_M), which is still O(N_M).

The reason we have the rate limiter is because when we had the XDS state machine implemented to handle the XDS request-response proto, in scale environments when there was a lot of churn, the proto took a while to converge resulting in undesirable CPU spikes. The rate limiter was meant to rate limit events of the same kind. If we do not suffer from the same problem at scale, then we don't need to rate limit those events. But we should proceed with caution here and test the proposed changes at scale to ensure we aren't seeing regressions.

Oct 13 '22 19:10 shashankram

I am generally in favor of simplifying and solving the couple issues we have. I agree with @shashankram we should do perf testing for these changes to ensure we are indeed solving the issues.

From looking deeper into this, it seems like we have 2 issues that need to be addressed and a 3rd that could is under consideration.

The two issues I see:

OOM because of the way the workqueue is used with rate limiting and storing complex objects on it. This is blocking and backing up the informer which grows the ring buffer the informer has and memory isn't reclaimed. (more info in https://github.com/openservicemesh/osm/pull/5197#issuecomment-1276856789)
Blocking when updating the proxies (more info in https://github.com/openservicemesh/osm/pull/5197#issuecomment-1276807687)

I think we should address both of these, and they should both be fairly straight forward to demonstrate improvements.

The 3rd consideration is the sliding window and per-proxy updates. This one seems a bit trickier to prove improvements. Do we have testing scenarios that we ran that would demonstrate we don't get spikes in cpu at certian loads?

For me, since we are doing level based processing in the proxy updates, I am not sure the sliding windows adds much as long as we queue up an additional change if it comes in while things are processing. If I read it right I think the sliding window actually delays processing as we wait for other events? In that case some type of immediate processing makes sense but I am not sure if this would lead to the spiky cpu, which is why perf testing around these changes makes sense.

Oct 21 '22 16:10 jsturtevant

This PR will be closed due to a long period of inactivity. If you would like this PR to remain open then please comment or update.

Dec 21 '22 00:12 github-actions[bot]

PR closed due to inactivity.

Dec 28 '22 00:12 github-actions[bot]

osm osm copied to clipboard

Switch to a per-proxy coalescing mechanism, instead of a global coale…

osm
osm copied to clipboard