api
api copied to clipboard
OCPBUGS-32158: Add CEL validation for RegistrySources in Image API
HyperShift embeds Image config API directly, so any webhook validation will not be triggered. This adds CEL validation to the CRD directly to mitigate that.
ref: https://issues.redhat.com/browse/OCPBUGS-32158
@muraee: This pull request references Jira Issue OCPBUGS-32158, which is invalid:
- expected the bug to target the "4.16.0" version, but no target version was set
Comment /jira refresh
to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.
The bug has been updated to refer to the pull request using the external bug tracker.
In response to this:
HyperShift embeds Image config API directly, so any webhook validation will not be triggered. This adds CEL validation to the CRD directly to mitigate that.
ref: https://issues.redhat.com/browse/OCPBUGS-32158
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
Hello @muraee! Some important instructions when contributing to openshift/api: API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.
/jira refresh
@muraee: This pull request references Jira Issue OCPBUGS-32158, which is valid. The bug has been moved to the POST state.
3 validation(s) were run on this bug
- bug is open, matching expected state (open)
- bug target version (4.16.0) matches configured target version for branch (4.16.0)
- bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.
In response to this:
/jira refresh
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
/retest-required
/retest-required
/retest-required
@muraee: The following test failed, say /retest
to rerun all failed tests or /retest-required
to rerun all mandatory failed tests:
Test name | Commit | Details | Required | Rerun command |
---|---|---|---|---|
ci/prow/e2e-gcp | ad2927ae55c1142eb2f3a02afd74e579f801279c | link | false | /test e2e-gcp |
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
/retest-required
/lgtm
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: csrwng, muraee Once this PR has been reviewed and has the lgtm label, please assign spadgett for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
cc @deads2k
looks good to me from a registry perspective 👍🏼
New changes are detected. LGTM label has been removed.
cc @JoelSpeed @deads2k could take a look please.
Can you please link to where this is validated in a webhook for core OCP?
When was this CRD introduced into HyperShift, could there be existing resources that become broken by the addition of this CEL?
Some code generation issues here as well, will need to rerun the CRD generation
@JoelSpeed not sure where/if the webhook exist, I was just saying, if it exists then it won't run for Hypershift.
This is validated in the MCO code directly, see: https://github.com/openshift/machine-config-operator/blob/eeea7495bc40d0f73dd9c2ba030e678ec598d8b1/pkg/controller/container-runtime-config/helpers.go#L504-L506
and its also mentioned on the field description https://github.com/openshift/api/blob/85dc560939ef05322ce0b204174e9a02c5376c19/config/v1/types_image.go#L178
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: csrwng, muraee Once this PR has been reviewed and has the lgtm label, please assign bparees for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
When was this CRD introduced into HyperShift
It's been there since the beginning, I can see it in our API in release-4.12
could there be existing resources that become broken by the addition of this CEL?
I don't think they can exist, as this is already not allowed by MCO
/retest-required
It doesn't seem to be a webhook... I found two places where this validation is enforced:
- https://github.com/openshift/machine-config-operator/blob/eeea7495bc40d0f73dd9c2ba030e678ec598d8b1/pkg/controller/container-runtime-config/helpers.go#L504-L506
- https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/controller/build/build_controller.go#L2226-L2228
Although there might be more that I missed 🤔
I don't think they can exist, as this is already not allowed by MCO
How does MCO prevent it? If it's not via an admission time check (webhook?) then the resources absolutely can exist. They may break MCO but they could exist in the etcd DB. Ratcheting validation may save you, but you'll need the management cluster to be on 4.18 for that, which I expect won't be happening any time soon?
It doesn't seem to be a webhook... I found two places where this validation is enforced:
https://github.com/openshift/machine-config-operator/blob/eeea7495bc40d0f73dd9c2ba030e678ec598d8b1/pkg/controller/container-runtime-config/helpers.go#L504-L506 https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/controller/build/build_controller.go#L2226-L2228 Although there might be more that I missed 🤔
Would any of this cause the MCO to do degraded and become not-upgradeable?
@muraee: The following test failed, say /retest
to rerun all failed tests or /retest-required
to rerun all mandatory failed tests:
Test name | Commit | Details | Required | Rerun command |
---|---|---|---|---|
ci/prow/e2e-upgrade-minor | ad2927ae55c1142eb2f3a02afd74e579f801279c | link | true | /test e2e-upgrade-minor |
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.
/retest-required
Would any of this cause the MCO to do degraded and become not-upgradeable?
I don't know, but I asked here: https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1727348206792439
Tested briefly on a 4.18 cluster. It looks like it won't degrade anything in the MCO directly, but I'm pretty sure it will degrade when you upgrade the cluster (i.e. when a new controller rolls out.)
What happens is now the controller syncs will continuously error out on the broken image.config
Error syncing image config openshift-config: could not Create/Update MachineConfig: could not update policy json with new changes: invalid images config: only one of AllowedRegistries or BlockedRegistries may be specified
but nothing is syncing that to the pool/operator objects.
When the MCO upgrades, it will roll out a new controller, and the MCO will expect all MachineConfigs managed by the controllers (in this case, the registries configuration is what image.config eventually populates) to be updated to the new controller version. The containerruntimeconfigcontroller won't be able to do so for the image.config changes, so the MCO will eventually degrade and complain that there's a controller version mismatch for a machineconfig that it manages.
Interestingly it does look like openshift-samples degrade:
openshift-samples 4.18.0-0.nightly-2024-09-23-182657 True True True 30h Samples installation in error at 4.18.0-0.nightly-2024-09-23-182657: &errors.errorString{s:"global openshift image configuration prevents the creation of imagestreams using the registry "}
but I'm pretty sure it will degrade when you upgrade the cluster
Did you verify this? It sounds to me like this should be a pre-upgrade check and that the operator should report upgradeable false if it determines this scenario, or, can it not be determined prior to the upgrade being initiated because all of the existing MCs are at the correct version before this change was initiated?
@djoshy and I tested this, and indeed if the controller version changes (i.e. an upgrade), the pools become degraded and upgradeable=false gets set. Interestingly the CO itself doesn't seem to degrade, which I've opened a side bug here: https://issues.redhat.com/browse/OCPBUGS-42546
It sounds to me like this should be a pre-upgrade check and that the operator should report upgradeable false if it determines this scenario, or, can it not be determined prior to the upgrade being initiated because all of the existing MCs are at the correct version before this change was initiated?
The current setup is that until a new controller rolls out, the render controller isn't able to determine this. That said, the containerruntimeconfigcontroller should either be able to degrade the pool or the CO when the failure initially happens, so we don't get to this stage I think. If we're in consensus there, I can create a card to track adding that failure mode.
For this validation specifically I think having it at the CEL level would be better than the MCO behaviour currently
Ok, I think we should proceed with the CEL, but I'd like to try and test the ratcheting scenario. Ratcheting validation was added in 1.30, so now that OCP is based on 1.31, we should be able to leverage it safely (delayed a release because CRDs are upgraded prior to API servers in some cases)