operator-lifecycle-manager
operator-lifecycle-manager copied to clipboard
(bugfix): reduce frequency of update requests for copied CSVs
Description of the change: Reduces the frequency of update requests for copied CSVs by:
- Adding the
olm.operatorframework.io/nonStatusCopyHashannotation to the copied CSVs, populated with a hash of all the non-status fields of the original CSV. - Adding the
olm.operatorframework.io/statusCopyHashannotation to the copied CSVs, populated with a hash of all the status fields of the original CSV. - Updating the above annotations as necessary on the copied CSVs when changes are made to the original CSV
This appears to have been the desired behavior of this function, as evidenced by: https://github.com/operator-framework/operator-lifecycle-manager/blob/461018e2e232865d97830b3accad9f1e8a58c7bc/pkg/controller/operators/olm/operatorgroup.go#L829-L830
The problem with the previous implementation was that these annotations were never set on the copied CSVs.
Motivation for the change:
- Reduce the frequency of
UPDATErequests made to the Kubernetes API Server for copied CSVs
Architectural changes:
Testing remarks:
Reviewer Checklist
- [ ] Implementation matches the proposed design, or proposal is updated to match implementation
- [ ] Sufficient unit test coverage
- [ ] Sufficient end-to-end test coverage
- [ ] Bug fixes are accompanied by regression test(s)
- [ ] e2e tests and flake fixes are accompanied evidence of flake testing, e.g. executing the test 100(0) times
- [ ] tech debt/todo is accompanied by issue link(s) in comments in the surrounding code
- [ ] Tests are comprehensible, e.g. Ginkgo DSL is being used appropriately
- [ ] Docs updated or added to
/doc - [ ] Commit messages sensible and descriptive
- [ ] Tests marked as
[FLAKE]are truly flaky and have an issue - [ ] Code is properly formatted
Those annotations were set on the CSVs in the informer's cache when the hash comparison was introduced (https://github.com/operator-framework/operator-lifecycle-manager/commit/cf618d3e89a2ba4ade3c50cbb03ff37795a3e5fb). The observation was that CSV copies needed to be written if and only if the copy was out of sync with the original, so caching a pruned object with a hash was much cheaper than caching complete objects while still being able to stomp CSV copies at the appropriate times. This seems to have regressed as part of https://github.com/operator-framework/operator-lifecycle-manager/pull/3003.
Edit: Breaking the hash comparison would have masked this with the unnecessary update requests, but using a partialobjectmeta informer for CSV copies means that olm-operator doesn't have enough information to determine whether or not a copy is in sync with the original. Even if a hash of the original is included in an annotation when olm-operator writes a copy, if someone else makes a change to a copy's spec, that won't be observable by olm-operator.
@benluddy I think we want to push ahead with this PR for a couple reasons:
- reverting #3003 would be difficult at this late date because it had follow-on work AND we need the memory reduction.
- there have not been any OLM users noticing the introduced problems from #3003
- this PR suppresses sending anything to the API server when annotations appear, which does help out users who have noticed the increased activity.
there have not been any OLM users noticing the introduced problems from https://github.com/operator-framework/operator-lifecycle-manager/pull/3003
You should find with this PR that you can issue a write to a copied CSV that puts it out of sync with the original CSV, and this control loop will not bring the copy back into sync with the original.
Edit: I am a little confused by this response. Isn't this PR an attempt to fix a problem that was both reported by a user and introduced by the linked PR?
there have not been any OLM users noticing the introduced problems from #3003
You should find with this PR that you can issue a write to a copied CSV that puts it out of sync with the original CSV, and this control loop will not bring the copy back into sync with the original.
Edit: I am a little confused by this response. Isn't this PR an attempt to fix a problem that was both reported by a user and introduced by the linked PR?
@benluddy I think what @bentito is trying to say is that we have received no complaints from users about the bug you called out specifically related to writes to a copied CSV, without changing the annotation, would not be reverted.
My understand of the current plan, and I may be a tad out of the loop due to hurricane preparations, is:
- Merge this PR to fix a critical bug causing users higher logging costs and load on the API server
- If bug reports come in regarding copied CSV writes not being reverted, investigate how to resolve that
I agree that reverting #3003 is the right approach to fix both the critical bug and the bug you called out that was introduced by the use of the PartialObjectMetadata informer. There are folks with concerns about reverting that change because we needed the performance gain to enable folks running on the edge to be able to use OLM. Reverting #3003 is probably still the best thing to do and identify a way to still achieve enough performance gain for the edge use cases, but the priority is being placed on resolving the update 100% of the time ASAP and the consensus of the current maintainers seems to be that this PR is what allows us to achieve that and give us some breathing room to re-assess.
I know it isn't a great answer, but hopefully this gives some insight as to the decisions the team has made re: continuing with this PR.
@everettraven I don't think Ben is suggesting that we revert https://github.com/operator-framework/operator-lifecycle-manager/pull/3003, I think he is suggesting that we don't introduce another regression on top of the previous one as a fix. With this change, I think we are putting ourselves in a situation where we are going to ignore spec changes in the copied csvs, which I'm pretty sure are considered in resolution.
@kevinrizza The regression is technically caused by #3003 . We just would realize it in this PR because we add the annotations back that prevent updating 100% of the time.
That being said, maybe I'm missing that there is a different way to do this?
Thinking about this a bit more, maybe we can remove the hash annotation and always calculate the hash of the copied CSV and compare that?
EDIT: This isn't possible without reverting #3003 because the lister returns a PartialObjectMeta which doesn't contain spec or status information
EDIT EDIT: This could be possible, likely with significant refactoring. I haven't had time to dig into this more. @bentito is taking over this work since I'll be impacted by Hurricane Milton.
Seems likely that some of the observed memory reductions over caching pruned CSVs might have come from reducing the high watermark transient allocations from decoding a big list response (unlike CSVs, or any CRD-backed resource, a PartialObjectMetaList can be encoded as Protobuf and not only JSON).
EDIT: This isn't possible without reverting https://github.com/operator-framework/operator-lifecycle-manager/pull/3003 because the lister returns a PartialObjectMeta which doesn't contain spec or status information
Right, without the full object, the copy loop can't see all the fields that might have diverged from the original. The metadata.generation field is automatically incremented for all CRs on updates that change non-metadata fields (https://github.com/kubernetes/kubernetes/blob/d9c46d8ecb1ede9be30545c9803e17682fcc4b50/staging/src/k8s.io/apiextensions-apiserver/pkg/registry/customresource/strategy.go#L185) so you might be able to keep track of the last observed generation per copy with only metadata available. The only problem would be that the status subresource is enabled on the CSV CRD, so status updates don't cause metadata.generation to be incremented.