kpt-config-sync
kpt-config-sync copied to clipboard
[WIP] Prototype autoscaling by object count
- Add .status.source.objectCount to track number of objects found after parsing from source (after hydration, if any).
- Add .status.sync.objectCount to track number of objects in the inventory ResourceGroup's .spec.resources.
- Add resource doubling for the reconciler container when the object count (greater of sync & source) is at least 1,000 (autopilot only).
- Reduce the default resources for the reconciler on autopilot
Notes:
- Since the applier doesn't send the inventory size over events, we have to query the ResourceGroup after each inventory update event. This could be made more efficient with changes to the applier in cli-utils, but isn't really a big deal.
- Autoscaling is currently very trivial and could be fine tuned, but this is just a proof of concept to show an MVP and test that it would work on our tests.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please ask for approval from karlkfi. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
Example memory use during TestStressMemoryUsageOCI on GKE Autopilot:
Memory usage is not solely determined by the object count. For instance, 1k objects of the same Kubernetes Kind may consume less memory than 100 objects with various kinds. Additionally, the number of unmanaged cluster objects may also contribute to memory consumption.
I feel like it may not be very accurate to have a linear relation between the resource defaults and the object count.
All of that is of course true. But tracking object count is a lot easier than tracking memory and CPU usage. This is partly why I didn't make the steps more granular/smaller and why I left some headroom above the current test usage.
Reaslisticly, autoscaling based on actual usage is problematic for multiple reasons:
- VPA uses metrics-server, which is a single point of failure, acting as a backend for an APIService, which causes API Discovery disruptions and outages when it's rescheduled or runs out of memory.
- We could alternatively use a custom provider that talks to prometheus and/or Cloud Monitoring, but this requires multiple provider implementations and manually configured RBAC/IAM and maybe service account management.
- Metrics are trailing/delayed, which is sometimes too late to scale before an OOMKill
- If you're at the limit of the CPU/Memory, you can't tell how much it wants to use, only that it wants more. So scaling up once may not be enough. You may have to scale up multiple times, causing significant disruption.
- Every time you change the resources, the reconciler pod needs to be evicted and rescheduled, which increases sync latency and reduces responsiveness.
Autoscaling based on some other metric, like resources under management, allows us to sidestep a lot of the issues with VPA, because the object count is available before syncing (after parsing) and then after syncing the object count in the inventory is easily accessible, even if the count from the source has been reduced. So we can use that to add more resources before a sync and keep the resources high until after the sync and the inventory has been updated.
While the changes are less granular and less ideal, it does allow much faster scale up and scale down with significantly less disruption, no single point of failure that affects other controllers, and faster sync latency overall.
Here's a shot of the usage and limits with this prototype while running --test-features=reconciliation-2
e2e tests, which don't sync a lot of files.
You can see that the 256Mi reconciler memory is being rounded up to 416Mi, because of the other containers in the pod and autopilot rounding up. So I might lower the reconciler default even more so it only rounds up to 256Mi. However, this will require reducing the CPU even more as well, which might impact sync latency.
As another datapoint, here is TestStressManyDeployments on Autopilot, which uses ~300MiB memory to deploy 1,000 Deployments.
Here's a new test called TestStressManyRootSyncs that deploys the same number of 1,000 Deployments, but uses 10 RootSyncs to do it, 100 each. You can see that with the autoscaling, these RootSyncs each get 436MiB limit and use 180MiB at most. So with the custom autoscaling each RootSync reconciler is saving 256MiB+ memory, or about 1/3 of the previous default. It would be more, but Autopilot is rounding up a lot.
It looks like we could also go lower as well, maybe another 256MiB after rounding, at least with only ~1k objects on the cluster. The story might be different if there were 10k+, but maybe it's ok to require resource overrides on large clusters...
Here's the same test with 2k deployments (10x200).
Memory use went up, but not by 100%, more like 70% (180Mi -> 260Mi).
For comparison, here's the same number of deployments but more reconcilers (20x100).
Interestingly, the peak memory use is basically the same as the 10x200 test, even with half the objects per reconciler. It also peaks on the initial deploy, not just the teardown. The drop after the apply is also more pronounced, possible due to memory pressure triggering Go's garbage collection more aggressively. And you can see that the reconcilers that peaked later peaked higher, probably because their watch cache was being populated by more objects on the cluster, even though all the RootSyncs were created around the same time.
Here's 20x200 (4000 deployments total).
With this configuration, some of the reconcilers are reaching the new lower 436Mi memory limit, but it still completed. From this we can clearly see total resources being watched is a significant contributor to memory use, compared to the 10x200 test we can see almost double the amount of memory being used for each reconciler.
If this trend continues, I expect that adding more deployments to the cluster will cause the reconcilers to be oomkilled, even though they're no where near the 1k object trigger to bump up the memory usage.
Edit: Here's another 20x200 run that didn't hit the limits.
At 30x200 the test fails to complete. The reconcilers are repeatedly OOMKilled before they can finish syncing.
It seems like maybe autoscaling based purely on the number of objects being managed by the current reconciler isn't going to work on larger clusters (5k+ objects in watched resource types).
So I figured out how to modify the remediator to filter watches by label and added a filter that should only watch resources with the correct cli-utils.sigs.k8s.io/inventory-id
label for that reconciler. However, I also discovered that the remediator doesn't actually use the Informer library, so it doesn't actually have an object cache. So filtering the remediator didn't affect the memory usage much.
Here's another 20x200 with the new filter:
Turns out, the label selector I used for the previous test was filtering out all objects, but the test still passed because remediation wasn't required for the test to pass.
It looks like there aren't any existing labels that can be used for selection, because the only pre-existing label uses the same value for all reconcilers.
So instead, I added a new label, or actually, copied an annotation to also be a label: config.k8s.io/owning-inventory
. Then I added label selectors for the remediator and applier/destroyer to only watch objects with the label value that matches the current reconciler. This means we can stop watching objects synced by other reconcilers and mostly remove the memory bloat from large numbers of objects on the cluster.
Now, the memory mostly scales by number of objects being synced, rather than also by the number of objects on the cluster.
Here's 20x200:
And here's 30x200:
As you can see, they both use about the same amount of memory for each reconciler: less that 150MiB.
10x500 seems to cap out a little higher at ~170Mi
10x700 got up to ~200Mi
Having quota problems with larger numbers of nodes, so here's 3x990. Keeping it just under the 1000 count, but it obviously doesn't need even half of the limit here. Maxing out at ~215MiB
2x2000 here. Maxes out at ~275Mi
1x5000 here. Maxes out at ~525Mi
Crunched some of those numbers (with the watch filtering) and got a pretty flat trend line:
These numbers are with a very simple Deployment spec replicated many times, so it may not match actual workloads, which will have both bigger (e.g. CRDs) and smaller (e.g. Service) objects. But it seems like a decent place to start.
New custom autoscaling logic is a bit more complicated.
It uses 1Mi for every 10 objects, on top of a 128Mi base, then rounds down to 256Mi increments, with a 256Mi minimum. It calculates the memory first, then calculates the CPU from the memory, keeping the 1:1 ratio required by autopilot.
That said, since this is just for the first container, the resources may still be rounded up by autopilot on all but the latest GKE v1.29+ clusters (created no earlier than v1.26).
For now, the requests & limits are the same, but we could change this in the future for new clusters, with a cluster version check. However, I'm not sure how to check that the cluster is using cgroups v2.
Looks like I missed a critical constraint. Apparently GKE Autopilot has a minimum memory per Pod of 512 MiB, for older clusters that don't support bursting. So I won't be able to get it down to 256MiB unless it's a new cluster on rapid channel.
https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-resource-requests#compute-class-min-max
Interesting data point here. With the watch filtering and custom autoscaling, a 672MiB limit still isn't enough to prevent TestStressManyDeployments
from failing with 1x4000 Deployments.
One difference between TestStressManyDeployments and TestStressManyRootSyncs, is that with TestStressManyDeployments, it uses root-sink
to manage the Namespace and the Deployments, which means the watches are cluster-scoped, not namespace scoped. But hopefully that doesn't matter with the watch filer.
The OOMKill is pretty obvious from the usage drop, but the limit drop is surprising. I think it's a bug.
I think the status.sync.objectCount
is being removed from the RootSync during syncing. This might be due to the periodic sync status updates during syncing. Or might be in a status update prior to syncing. If so, it's possible removing the status.sync.objectCount
triggered the reconciler-manager to lower the memory, causing the OOMKill. Needs more investigation.
I think I also found a related bug that was causing the status.sync.commit
to be removed after the source cache was reset, because the code that updates status.sync.commit
gets the commit from the source cache. I think I fixed that one locally though.
Ahh. ok. design problem...
~~The status.sync.objectCount
is being updated before each apply attempt, to match the objects in the new status.sync.commit
. But that means the status.sync.objectCount
is decremented before the objects have been actually deleted. So that triggers scale down when the objectCount was large enough to previously cause scale up.~~
~~Since updating the status.sync.objectCount
when the status.sync.commit
is updated makes intuitive sense, I don't really want to change that behavior. Instead, we need another field to record the inventory size until after the applier has succeeded. Maybe just a status.inventoryCount
or status.inventoryObjectCount
or status.managedObjectCount
?~~
Correction:
The status.source
is being updated before each apply attempt, not the status.sync
. The status.sync
is being updated asynchronously while applying.
The actual problem is that the periodic sync status update often updates the status before any inventory events have been received. Normally this is fine, because the inventory size is cached in the applier. But if the reconciler was OOMKilled, it would have restarted and cleared the cache. So the next time it tries to apply, it would wipe out the status.sync.objectCount
, causing down scaling.
I think that means we either need to: A) delay concurrent sync status updates until after the first inventory event from the applier (slightly delayed adding/updating of the sync status and syncing condition) B) look up the inventory size before starting an apply attempt (an extra API call)
A is slightly more efficient, but more complicated to coordinate. B is simpler to implement.
It required a lot of changes, but I managed to get the sync status to persist between reconciler restarts (except errors).
To get it to work I refactored the Applier and Updater so the Updater is the one caching sync status (errors, object count, commit). I added another event handling layer so the Applier could punt these to the Updater without needing to keep its own record of errors or object count. Then I added an initialization stage to the Updater so when it runs the first time it pulls the current status from the RSync.
Unfortunately, errors can't be parsed after being formatted, so we can keep those, but that's mostly ok because the errors were already being invalidated before each Applier.Apply attempt anyway.
When I got that working, I found that my test was still being OOMKilled, even though the memory usage was always bellow the limit. So I had to increase the memory per object to make sure the node had enough memory.
The new memory calculation is: Max( RoundUpToNearestMultiple( (object count / 5 ) + MIN_MEM, STEP_SIZE ), MIN_MEM )
with MIN_MEM=128
and STEP_SIZE=32
The CPU calculation is the same: Max( Ceil( MEM_LIMIT * CPU_RATIO ), CPU_MIN )
with CPU_MIN=125
and CPU_RATIO=1000/1024
So for example:
- if object count is
0
, reconciler memory limit is128MiB
, and cpu limit is125m
- if object count is
3000
, reconciler memory limit is736MiB
, and cpu limit is720m
So I modified one of the tests to deploy incrementally more Deployments, pruning them all after each step.
The test took 2.1hrs, so I probably wont keep it running in the e2e tests, but I might leave it in as skipped to run on-demand.
This is on GKE Autopilot v1.27.11-gke.1062000 (like the rest). So no bursting.
Deployments | Mem Usage | Mem Limit | CPU Usage | CPU Limit |
---|---|---|---|---|
500 | 211 | 416 | 0.096 | 0.47 |
1000 | 289 | 416 | 0.21 | 0.47 |
1500 | 386 | 448 | 0.193 | 0.47 |
2000 | 495 | 672 | 0.218 | 0.72 |
2500 | 561 | 672 | 0.215 | 0.72 |
3000 | 603 | 736 | 0.341 | 0.72 |
3500 | 709 | 928 | 0.307 | 0.97 |
4000 | 755 | 960 | 0.386 | 0.97 |
@karlkfi: The following test failed, say /retest
to rerun all failed tests or /retest-required
to rerun all mandatory failed tests:
Test name | Commit | Details | Required | Rerun command |
---|---|---|---|---|
kpt-config-sync-presubmit | eaeb1bb02dd66403d304c8896602298e32b8bcb2 | link | true | /test kpt-config-sync-presubmit |
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.
@karlkfi: The following test failed, say /retest
to rerun all failed tests or /retest-required
to rerun all mandatory failed tests:
Test name | Commit | Details | Required | Rerun command |
---|---|---|---|---|
kpt-config-sync-presubmit | 3b80a65caa2db5f49b00f710a6de6b65aeebab2d | link | true | /test kpt-config-sync-presubmit |
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.