operator-lifecycle-manager
operator-lifecycle-manager copied to clipboard
đ Reduce Frequency of Update Requests for Copied CSVs
Description of the change:
Please checkout this doc on scoping out this change: https://docs.google.com/document/d/1P4cSYEP05vDyuhBfilyuWgL5d5OOD5z7JlAyOxpPqps
In this PR we are resurrecting #3411 with the intent to fix what that PR was originally going to fix. Follow on work will address the then revealed problem with metadata.generation|resourceVersion as per the doc comment by @tmshort
Motivation for the change: [from the linked doc, "How Did We Get Here:]
- Original Change for Memory Optimization: Sixteen months ago, we merged a PR in OLMv0 that converted the cache of copied ClusterServiceVersions (CSVs) to use PartialObjectMetadata types instead of caching the full CSV objects. This change was crucial for memory utilization performance gains, enabling OLM to run efficiently on MicroShift, a lightweight Kubernetes distribution designed for edge computing.
- Limited Access to Spec/Status: By using PartialObjectMetadata, we only have access to the metadata of copied CSVs, not their spec or status fields. This means the operator lacks the information needed to compare the full content of the copied CSVs with the originals.
- Removal of âHash and Compareâ Logic: The change inadvertently removed a core piece of the âhash and compareâ logic. Previously, the operator used annotations containing hashes of the non-status and status fields of the original CSV to determine if a copied CSV needed updating. These annotations were not set on the copied CSVs after the change.
- Resulting in Excessive Updates: Without the ability to compare hashes, the operator began issuing updates for copied CSVs 100% of the time, regardless of whether they were in sync with the originals. This behavior introduced a high load on the Kubernetes API server, especially in environments with many namespaces and CSVs installed in AllNamespace mode. The increased load also led to higher audit log volumes, impacting users with increased logging costs.
Architectural changes:
- Reintroducing Annotations: A proposed fix PR adds back the annotations olm.operatorframework.io/nonStatusCopyHash and olm.operatorframework.io/statusCopyHash to the copied CSVs. These annotations store hashes of the non-status and status fields of the original CSV, respectively.
- Reducing Unnecessary Updates: By comparing these hashes, the operator can determine if the copied CSVs are out of sync with the originals and only issue updates when necessary. This reduces the frequency of update requests to the API server and lowers audit log volumes.
- Uncovering a New Bug: However, reintroducing the hash comparison logic will uncover a bug due to the use of PartialObjectMetadata for caching copied CSVs. Since we only have access to metadata, if a user manually modifies the spec or status of a copied CSV without changing the hash annotations, the operator cannot detect the change. The operator would incorrectly assume the copied CSV is in sync with the original, leading to potential inconsistencies.
Testing remarks:
Except from the expected changes around the inability to track copied CSV changes made by a user, we should be careful to test the following:
- Cannot Revert Memory Optimization: Reverting the changes that introduced PartialObjectMetadata caching is not feasible. The memory optimization is critical for running OLM on MicroShift and supporting edge computing scenarios where resources are constrained.
Reviewer Checklist
- [ ] Implementation matches the proposed design, or proposal is updated to match implementation
- [ ] Sufficient unit test coverage
- [ ] Sufficient end-to-end test coverage
- [ ] Bug fixes are accompanied by regression test(s)
- [ ] e2e tests and flake fixes are accompanied evidence of flake testing, e.g. executing the test 100(0) times
- [ ] tech debt/todo is accompanied by issue link(s) in comments in the surrounding code
- [ ] Tests are comprehensible, e.g. Ginkgo DSL is being used appropriately
- [ ] Docs updated or added to
/doc - [ ] Commit messages sensible and descriptive
- [ ] Tests marked as
[FLAKE]are truly flaky and have an issue - [ ] Code is properly formatted
In general, I think it lgtm. I'm curious why we need the annotations at all. Would it be sufficient to do something like:
generate the desired copied csv from the parent csv get the existing copied csv if not found => create desired copied csv and return else => compute desired and existing non-status hashes if desiredNonStatusHash != existingNonStatusHash => update copied csv and return else, compute desired and existing status hashes if desiredStatusHash != existingStatusHash => update status and return else, return
I think the answer might be because the sync function is being driven off parent/non-copied csv events. Even so, we should be able to use the general approach outlined above (minus the returns). wdyt? I guess the only major difference is swapping memory (storing the hashes in the annotations) for computation (calculating both desired and current hashes). Maybe that's cleaner and adds fewer implementation concerns to the copied csv?
In general, I think it lgtm. I'm curious why we need the annotations at all. Would it be sufficient to do something like:
generate the desired copied csv from the parent csv get the existing copied csv if not found => create desired copied csv and return else => compute desired and existing non-status hashes if desiredNonStatusHash != existingNonStatusHash => update copied csv and return else, compute desired and existing status hashes if desiredStatusHash != existingStatusHash => update status and return else, return
I think the answer might be because the sync function is being driven off parent/non-copied csv events. Even so, we should be able to use the general approach outlined above (minus the returns). wdyt? I guess the only major difference is swapping memory (storing the hashes in the annotations) for computation (calculating both desired and current hashes). Maybe that's cleaner and adds fewer implementation concerns to the copied csv?
I've not looked at the exact changes in the PR, but having been originally involved in bug when it first rolled around the reason I had determined we couldn't do this is because we only ever cached the copied CSV metadata and not the whole copied CSV.
IIRC the bug was originally concerned with the numerous update calls, which were happening because we always decided we needed to update the CSVs.
At the time, it was also determined that we could change what we cached because edge-computing solutions like MicroShift needed the reduction in memory footprint to run OLM.
Because of this, you would always have to make a GET call to the kube-apiserver for every copied CSV you need to evaluate. This could lead to at worst making twice the API calls we were already making -- a GET followed by an UPDATE as opposed to always issuing an UPDATE. At best, this was exactly the same -- always making a GET request.
If the constraints of not being able to change how the data is cached are still in play, you'll end up with the same problem. The bug was originally filed as a "OLM makes too many calls to the Kubernetes API server, inflating the audit logs and thus our log ingestion/monitoring bill". Switching from always issuing an UPDATE to always issuing a GET and possibly an UPDATE would make this worse than just not fixing the bug.
Because of this, you would always have to make a GET call to the kube-apiserver for every copied CSV you need to evaluate. This could lead to at worst making twice the API calls we were already making -- a GET followed by an UPDATE as opposed to always issuing an UPDATE. At best, this was exactly the same -- always making a GET request.
Gotcha. I missed that we weren't doing a live GET call. But, rather pulling it from cache. That all makes sense. Thank you =D
It would be cool if we could add a regression test of some sort to measure the number of api server requests or log size or something that helps us verify/measure the effectiveness of the fix (and make sure we don't fall in the same trap again). Though, only if there is a relatively easy way to do it. Otherwise, could we post up some manual verification? i.e. pre-fix logs are big, post-fix logs are ok. Or maybe grabbing some api server stats or something...? wdyt?
@per replying to your comment:
Would it be sufficient to do something like: generate the desired copied csv from the parent csv
Why annotations vs âjust re-hash everything on the flyâ
-
Performance & churn Marshaling the entire CSV spec+status and hashing it on every parent-CSV event could get expensive (especially in big clusters with many CSVs) and can lead to unnecessary Update calls (and status flaps) if you miss a subtle drift.
-
Persistence across restarts We need to remember what we last applied, even if the controller pod restarts or skips an event. Annotations live on the resource itself, so we donât need an external cache or to recompute âlast appliedâ from scratch.
-
Robust âdrift-guardâ By also storing observed generation & resourceVersion, we catch manual edits made directly to the copied CSV (spec drift or status drift) even if no new parent-CSV event fires. Without annotations youâd have to separately watch the copied CSVs, trigger extra reconciles, and still figure out whether you should stomp on user edits or not.
-
Simpler reconciliation logic Instead of a big âcompute desired, compute existing, diff non-status, diff status, decide what to doâ every single time, the annotations let you do a cheap string compare in most cases (âis desiredHash == lastAppliedHash?â) and skip all the heavy lifting.
Youâre absolutely right that a pure-compute approach is conceptually possible, but to get the same guarantees (no missed updates, no unnecessary writes, survive restarts, catch manual drift) youâd end up re-implementing a lot of what annotations give you âfor free.â That was the driving motivation here: swap a tiny bit of metadata on the resource for a much simpler, more robust controller loop.
Original Change for Memory Optimization: Sixteen months ago, we merged a https://github.com/operator-framework/operator-lifecycle-manager/pull/3003 in OLMv0 that converted the cache of copied ClusterServiceVersions (CSVs) to use PartialObjectMetadata types instead of caching the full CSV objects.
The type of the Go objects in the cache is distinct from the API resource it's built from -- you could use a CSV ListerWatcher and configure a https://pkg.go.dev/k8s.io/client-go/tools/cache#TransformFunc on the informer that transforms a CSV into whatever compact representation you need.
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Closing in favor #3597