kpt-config-sync [WIP] Prototype autoscaling by object count

Add .status.source.objectCount to track number of objects found after parsing from source (after hydration, if any).
Add .status.sync.objectCount to track number of objects in the inventory ResourceGroup's .spec.resources.
Add resource doubling for the reconciler container when the object count (greater of sync & source) is at least 1,000 (autopilot only).
Reduce the default resources for the reconciler on autopilot

Notes:

Since the applier doesn't send the inventory size over events, we have to query the ResourceGroup after each inventory update event. This could be made more efficient with changes to the applier in cli-utils, but isn't really a big deal.
Autoscaling is currently very trivial and could be fine tuned, but this is just a proof of concept to show an MVP and test that it would work on our tests.

Mar 29 '24 19:03 karlkfi

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from karlkfi. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Mar 29 '24 19:03 google-oss-prow[bot]

Example memory use during TestStressMemoryUsageOCI on GKE Autopilot: Screenshot from 2024-03-29 12-26-13

Mar 29 '24 19:03 karlkfi

Memory usage is not solely determined by the object count. For instance, 1k objects of the same Kubernetes Kind may consume less memory than 100 objects with various kinds. Additionally, the number of unmanaged cluster objects may also contribute to memory consumption.

I feel like it may not be very accurate to have a linear relation between the resource defaults and the object count.

Mar 29 '24 20:03 nan-yu

All of that is of course true. But tracking object count is a lot easier than tracking memory and CPU usage. This is partly why I didn't make the steps more granular/smaller and why I left some headroom above the current test usage.

Reaslisticly, autoscaling based on actual usage is problematic for multiple reasons:

VPA uses metrics-server, which is a single point of failure, acting as a backend for an APIService, which causes API Discovery disruptions and outages when it's rescheduled or runs out of memory.
We could alternatively use a custom provider that talks to prometheus and/or Cloud Monitoring, but this requires multiple provider implementations and manually configured RBAC/IAM and maybe service account management.
Metrics are trailing/delayed, which is sometimes too late to scale before an OOMKill
If you're at the limit of the CPU/Memory, you can't tell how much it wants to use, only that it wants more. So scaling up once may not be enough. You may have to scale up multiple times, causing significant disruption.
Every time you change the resources, the reconciler pod needs to be evicted and rescheduled, which increases sync latency and reduces responsiveness.

Autoscaling based on some other metric, like resources under management, allows us to sidestep a lot of the issues with VPA, because the object count is available before syncing (after parsing) and then after syncing the object count in the inventory is easily accessible, even if the count from the source has been reduced. So we can use that to add more resources before a sync and keep the resources high until after the sync and the inventory has been updated.

While the changes are less granular and less ideal, it does allow much faster scale up and scale down with significantly less disruption, no single point of failure that affects other controllers, and faster sync latency overall.

Mar 29 '24 21:03 karlkfi

Here's a shot of the usage and limits with this prototype while running --test-features=reconciliation-2 e2e tests, which don't sync a lot of files.

You can see that the 256Mi reconciler memory is being rounded up to 416Mi, because of the other containers in the pod and autopilot rounding up. So I might lower the reconciler default even more so it only rounds up to 256Mi. However, this will require reducing the CPU even more as well, which might impact sync latency.

Screenshot from 2024-04-01 11-21-19

Apr 01 '24 20:04 karlkfi

As another datapoint, here is TestStressManyDeployments on Autopilot, which uses ~300MiB memory to deploy 1,000 Deployments.

Screenshot from 2024-04-01 14-32-24

Apr 01 '24 21:04 karlkfi

Here's a new test called TestStressManyRootSyncs that deploys the same number of 1,000 Deployments, but uses 10 RootSyncs to do it, 100 each. You can see that with the autoscaling, these RootSyncs each get 436MiB limit and use 180MiB at most. So with the custom autoscaling each RootSync reconciler is saving 256MiB+ memory, or about 1/3 of the previous default. It would be more, but Autopilot is rounding up a lot.

It looks like we could also go lower as well, maybe another 256MiB after rounding, at least with only ~1k objects on the cluster. The story might be different if there were 10k+, but maybe it's ok to require resource overrides on large clusters...

Screenshot from 2024-04-01 17-28-39

Apr 02 '24 00:04 karlkfi

Here's the same test with 2k deployments (10x200).

Memory use went up, but not by 100%, more like 70% (180Mi -> 260Mi).

Screenshot from 2024-04-02 12-04-21

Apr 02 '24 19:04 karlkfi

For comparison, here's the same number of deployments but more reconcilers (20x100).

Interestingly, the peak memory use is basically the same as the 10x200 test, even with half the objects per reconciler. It also peaks on the initial deploy, not just the teardown. The drop after the apply is also more pronounced, possible due to memory pressure triggering Go's garbage collection more aggressively. And you can see that the reconcilers that peaked later peaked higher, probably because their watch cache was being populated by more objects on the cluster, even though all the RootSyncs were created around the same time.

Screenshot from 2024-04-02 13-17-50

Apr 02 '24 20:04 karlkfi

Here's 20x200 (4000 deployments total).

With this configuration, some of the reconcilers are reaching the new lower 436Mi memory limit, but it still completed. From this we can clearly see total resources being watched is a significant contributor to memory use, compared to the 10x200 test we can see almost double the amount of memory being used for each reconciler.

If this trend continues, I expect that adding more deployments to the cluster will cause the reconcilers to be oomkilled, even though they're no where near the 1k object trigger to bump up the memory usage.

Screenshot from 2024-04-02 14-05-02

Edit: Here's another 20x200 run that didn't hit the limits.

Screenshot from 2024-04-02 18-08-45

Apr 02 '24 21:04 karlkfi

At 30x200 the test fails to complete. The reconcilers are repeatedly OOMKilled before they can finish syncing.

It seems like maybe autoscaling based purely on the number of objects being managed by the current reconciler isn't going to work on larger clusters (5k+ objects in watched resource types).

Screenshot from 2024-04-02 16-34-25

Apr 02 '24 23:04 karlkfi

So I figured out how to modify the remediator to filter watches by label and added a filter that should only watch resources with the correct cli-utils.sigs.k8s.io/inventory-id label for that reconciler. However, I also discovered that the remediator doesn't actually use the Informer library, so it doesn't actually have an object cache. So filtering the remediator didn't affect the memory usage much.

Here's another 20x200 with the new filter:

Screenshot from 2024-04-02 20-16-42

Apr 03 '24 03:04 karlkfi

Turns out, the label selector I used for the previous test was filtering out all objects, but the test still passed because remediation wasn't required for the test to pass.

It looks like there aren't any existing labels that can be used for selection, because the only pre-existing label uses the same value for all reconcilers.

So instead, I added a new label, or actually, copied an annotation to also be a label: config.k8s.io/owning-inventory. Then I added label selectors for the remediator and applier/destroyer to only watch objects with the label value that matches the current reconciler. This means we can stop watching objects synced by other reconcilers and mostly remove the memory bloat from large numbers of objects on the cluster.

Now, the memory mostly scales by number of objects being synced, rather than also by the number of objects on the cluster.

Here's 20x200: Screenshot from 2024-04-03 15-24-17

And here's 30x200: Screenshot from 2024-04-03 17-29-22

As you can see, they both use about the same amount of memory for each reconciler: less that 150MiB.

Apr 04 '24 00:04 karlkfi

10x500 seems to cap out a little higher at ~170Mi Screenshot from 2024-04-03 18-09-51

Apr 04 '24 01:04 karlkfi

10x700 got up to ~200Mi Screenshot from 2024-04-03 18-43-07

Apr 04 '24 01:04 karlkfi

Having quota problems with larger numbers of nodes, so here's 3x990. Keeping it just under the 1000 count, but it obviously doesn't need even half of the limit here. Maxing out at ~215MiB Screenshot from 2024-04-03 20-01-25

Apr 04 '24 03:04 karlkfi

2x2000 here. Maxes out at ~275Mi Screenshot from 2024-04-03 20-29-02

Apr 04 '24 03:04 karlkfi

1x5000 here. Maxes out at ~525Mi Screenshot from 2024-04-03 20-59-20

Apr 04 '24 04:04 karlkfi

Crunched some of those numbers (with the watch filtering) and got a pretty flat trend line:

Screenshot from 2024-04-09 18-05-39

These numbers are with a very simple Deployment spec replicated many times, so it may not match actual workloads, which will have both bigger (e.g. CRDs) and smaller (e.g. Service) objects. But it seems like a decent place to start.

Apr 10 '24 01:04 karlkfi

New custom autoscaling logic is a bit more complicated.

It uses 1Mi for every 10 objects, on top of a 128Mi base, then rounds down to 256Mi increments, with a 256Mi minimum. It calculates the memory first, then calculates the CPU from the memory, keeping the 1:1 ratio required by autopilot.

That said, since this is just for the first container, the resources may still be rounded up by autopilot on all but the latest GKE v1.29+ clusters (created no earlier than v1.26).

For now, the requests & limits are the same, but we could change this in the future for new clusters, with a cluster version check. However, I'm not sure how to check that the cluster is using cgroups v2.

Apr 18 '24 08:04 karlkfi

Looks like I missed a critical constraint. Apparently GKE Autopilot has a minimum memory per Pod of 512 MiB, for older clusters that don't support bursting. So I won't be able to get it down to 256MiB unless it's a new cluster on rapid channel.

https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-resource-requests#compute-class-min-max

Apr 18 '24 11:04 karlkfi

Interesting data point here. With the watch filtering and custom autoscaling, a 672MiB limit still isn't enough to prevent TestStressManyDeployments from failing with 1x4000 Deployments.

One difference between TestStressManyDeployments and TestStressManyRootSyncs, is that with TestStressManyDeployments, it uses root-sink to manage the Namespace and the Deployments, which means the watches are cluster-scoped, not namespace scoped. But hopefully that doesn't matter with the watch filer.

reconciler-watch-filter-1x4000-oomkill

The OOMKill is pretty obvious from the usage drop, but the limit drop is surprising. I think it's a bug.

I think the status.sync.objectCount is being removed from the RootSync during syncing. This might be due to the periodic sync status updates during syncing. Or might be in a status update prior to syncing. If so, it's possible removing the status.sync.objectCount triggered the reconciler-manager to lower the memory, causing the OOMKill. Needs more investigation.

I think I also found a related bug that was causing the status.sync.commit to be removed after the source cache was reset, because the code that updates status.sync.commit gets the commit from the source cache. I think I fixed that one locally though.

Apr 19 '24 02:04 karlkfi

Ahh. ok. design problem...

~~The status.sync.objectCount is being updated before each apply attempt, to match the objects in the new status.sync.commit. But that means the status.sync.objectCount is decremented before the objects have been actually deleted. So that triggers scale down when the objectCount was large enough to previously cause scale up.~~

~~Since updating the status.sync.objectCount when the status.sync.commit is updated makes intuitive sense, I don't really want to change that behavior. Instead, we need another field to record the inventory size until after the applier has succeeded. Maybe just a status.inventoryCount or status.inventoryObjectCount or status.managedObjectCount?~~

Correction:

The status.source is being updated before each apply attempt, not the status.sync. The status.sync is being updated asynchronously while applying.

The actual problem is that the periodic sync status update often updates the status before any inventory events have been received. Normally this is fine, because the inventory size is cached in the applier. But if the reconciler was OOMKilled, it would have restarted and cleared the cache. So the next time it tries to apply, it would wipe out the status.sync.objectCount, causing down scaling.

I think that means we either need to: A) delay concurrent sync status updates until after the first inventory event from the applier (slightly delayed adding/updating of the sync status and syncing condition) B) look up the inventory size before starting an apply attempt (an extra API call)

A is slightly more efficient, but more complicated to coordinate. B is simpler to implement.

Apr 19 '24 02:04 karlkfi

It required a lot of changes, but I managed to get the sync status to persist between reconciler restarts (except errors).

To get it to work I refactored the Applier and Updater so the Updater is the one caching sync status (errors, object count, commit). I added another event handling layer so the Applier could punt these to the Updater without needing to keep its own record of errors or object count. Then I added an initialization stage to the Updater so when it runs the first time it pulls the current status from the RSync.

Unfortunately, errors can't be parsed after being formatted, so we can keep those, but that's mostly ok because the errors were already being invalidated before each Applier.Apply attempt anyway.

When I got that working, I found that my test was still being OOMKilled, even though the memory usage was always bellow the limit. So I had to increase the memory per object to make sure the node had enough memory.

The new memory calculation is: Max( RoundUpToNearestMultiple( (object count / 5 ) + MIN_MEM, STEP_SIZE ), MIN_MEM ) with MIN_MEM=128 and STEP_SIZE=32

The CPU calculation is the same: Max( Ceil( MEM_LIMIT * CPU_RATIO ), CPU_MIN ) with CPU_MIN=125 and CPU_RATIO=1000/1024

So for example:

if object count is 0, reconciler memory limit is 128MiB, and cpu limit is 125m
if object count is 3000, reconciler memory limit is 736MiB, and cpu limit is 720m

Apr 25 '24 01:04 karlkfi

So I modified one of the tests to deploy incrementally more Deployments, pruning them all after each step.

The test took 2.1hrs, so I probably wont keep it running in the e2e tests, but I might leave it in as skipped to run on-demand.

This is on GKE Autopilot v1.27.11-gke.1062000 (like the rest). So no bursting.

Deployments	Mem Usage	Mem Limit	CPU Usage	CPU Limit
500	211	416	0.096	0.47
1000	289	416	0.21	0.47
1500	386	448	0.193	0.47
2000	495	672	0.218	0.72
2500	561	672	0.215	0.72
3000	603	736	0.341	0.72
3500	709	928	0.307	0.97
4000	755	960	0.386	0.97

Screenshot from 2024-04-24 21-30-00

Screenshot from 2024-04-24 21-05-09

Apr 25 '24 04:04 karlkfi

@karlkfi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
kpt-config-sync-presubmit	eaeb1bb02dd66403d304c8896602298e32b8bcb2	link	true	`/test kpt-config-sync-presubmit`

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

May 08 '24 22:05 google-oss-prow[bot]

@karlkfi: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
kpt-config-sync-presubmit	3b80a65caa2db5f49b00f710a6de6b65aeebab2d	link	true	`/test kpt-config-sync-presubmit`

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

May 28 '24 20:05 google-oss-prow[bot]

kpt-config-sync kpt-config-sync copied to clipboard

[WIP] Prototype autoscaling by object count

kpt-config-sync
kpt-config-sync copied to clipboard