api icon indicating copy to clipboard operation
api copied to clipboard

OCPBUGS-32158: Add CEL validation for RegistrySources in Image API

Open muraee opened this issue 10 months ago • 24 comments

HyperShift embeds Image config API directly, so any webhook validation will not be triggered. This adds CEL validation to the CRD directly to mitigate that.

ref: https://issues.redhat.com/browse/OCPBUGS-32158

muraee avatar Apr 17 '24 15:04 muraee

@muraee: This pull request references Jira Issue OCPBUGS-32158, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

HyperShift embeds Image config API directly, so any webhook validation will not be triggered. This adds CEL validation to the CRD directly to mitigate that.

ref: https://issues.redhat.com/browse/OCPBUGS-32158

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Apr 17 '24 15:04 openshift-ci-robot

Hello @muraee! Some important instructions when contributing to openshift/api: API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

openshift-ci[bot] avatar Apr 17 '24 15:04 openshift-ci[bot]

/jira refresh

muraee avatar Apr 17 '24 15:04 muraee

@muraee: This pull request references Jira Issue OCPBUGS-32158, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Apr 17 '24 15:04 openshift-ci-robot

/retest-required

muraee avatar Apr 18 '24 08:04 muraee

/retest-required

muraee avatar Apr 24 '24 14:04 muraee

/retest-required

muraee avatar Apr 26 '24 15:04 muraee

@muraee: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp ad2927ae55c1142eb2f3a02afd74e579f801279c link false /test e2e-gcp

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Apr 26 '24 18:04 openshift-ci[bot]

/retest-required

muraee avatar Apr 30 '24 14:04 muraee

/lgtm

csrwng avatar Apr 30 '24 14:04 csrwng

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: csrwng, muraee Once this PR has been reviewed and has the lgtm label, please assign spadgett for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Apr 30 '24 14:04 openshift-ci[bot]

cc @deads2k

muraee avatar May 10 '24 14:05 muraee

looks good to me from a registry perspective 👍🏼

flavianmissi avatar May 14 '24 15:05 flavianmissi

New changes are detected. LGTM label has been removed.

openshift-ci[bot] avatar Aug 29 '24 13:08 openshift-ci[bot]

cc @JoelSpeed @deads2k could take a look please.

muraee avatar Aug 29 '24 13:08 muraee

Can you please link to where this is validated in a webhook for core OCP?

When was this CRD introduced into HyperShift, could there be existing resources that become broken by the addition of this CEL?

JoelSpeed avatar Sep 02 '24 08:09 JoelSpeed

Some code generation issues here as well, will need to rerun the CRD generation

JoelSpeed avatar Sep 02 '24 09:09 JoelSpeed

@JoelSpeed not sure where/if the webhook exist, I was just saying, if it exists then it won't run for Hypershift.

This is validated in the MCO code directly, see: https://github.com/openshift/machine-config-operator/blob/eeea7495bc40d0f73dd9c2ba030e678ec598d8b1/pkg/controller/container-runtime-config/helpers.go#L504-L506

and its also mentioned on the field description https://github.com/openshift/api/blob/85dc560939ef05322ce0b204174e9a02c5376c19/config/v1/types_image.go#L178

muraee avatar Sep 02 '24 09:09 muraee

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: csrwng, muraee Once this PR has been reviewed and has the lgtm label, please assign bparees for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Sep 02 '24 10:09 openshift-ci[bot]

When was this CRD introduced into HyperShift

It's been there since the beginning, I can see it in our API in release-4.12

could there be existing resources that become broken by the addition of this CEL?

I don't think they can exist, as this is already not allowed by MCO

muraee avatar Sep 02 '24 10:09 muraee

/retest-required

muraee avatar Sep 02 '24 14:09 muraee

It doesn't seem to be a webhook... I found two places where this validation is enforced:

  • https://github.com/openshift/machine-config-operator/blob/eeea7495bc40d0f73dd9c2ba030e678ec598d8b1/pkg/controller/container-runtime-config/helpers.go#L504-L506
  • https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/controller/build/build_controller.go#L2226-L2228

Although there might be more that I missed 🤔

flavianmissi avatar Sep 02 '24 14:09 flavianmissi

I don't think they can exist, as this is already not allowed by MCO

How does MCO prevent it? If it's not via an admission time check (webhook?) then the resources absolutely can exist. They may break MCO but they could exist in the etcd DB. Ratcheting validation may save you, but you'll need the management cluster to be on 4.18 for that, which I expect won't be happening any time soon?

JoelSpeed avatar Sep 16 '24 08:09 JoelSpeed

It doesn't seem to be a webhook... I found two places where this validation is enforced:

https://github.com/openshift/machine-config-operator/blob/eeea7495bc40d0f73dd9c2ba030e678ec598d8b1/pkg/controller/container-runtime-config/helpers.go#L504-L506 https://github.com/openshift/openshift-controller-manager/blob/master/pkg/build/controller/build/build_controller.go#L2226-L2228 Although there might be more that I missed 🤔

Would any of this cause the MCO to do degraded and become not-upgradeable?

JoelSpeed avatar Sep 16 '24 08:09 JoelSpeed

@muraee: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-upgrade-minor ad2927ae55c1142eb2f3a02afd74e579f801279c link true /test e2e-upgrade-minor

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Sep 17 '24 16:09 openshift-ci[bot]

/retest-required

muraee avatar Sep 24 '24 15:09 muraee

Would any of this cause the MCO to do degraded and become not-upgradeable?

I don't know, but I asked here: https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1727348206792439

flavianmissi avatar Sep 26 '24 10:09 flavianmissi

Tested briefly on a 4.18 cluster. It looks like it won't degrade anything in the MCO directly, but I'm pretty sure it will degrade when you upgrade the cluster (i.e. when a new controller rolls out.)

What happens is now the controller syncs will continuously error out on the broken image.config Error syncing image config openshift-config: could not Create/Update MachineConfig: could not update policy json with new changes: invalid images config: only one of AllowedRegistries or BlockedRegistries may be specified but nothing is syncing that to the pool/operator objects.

When the MCO upgrades, it will roll out a new controller, and the MCO will expect all MachineConfigs managed by the controllers (in this case, the registries configuration is what image.config eventually populates) to be updated to the new controller version. The containerruntimeconfigcontroller won't be able to do so for the image.config changes, so the MCO will eventually degrade and complain that there's a controller version mismatch for a machineconfig that it manages.

Interestingly it does look like openshift-samples degrade:

openshift-samples                          4.18.0-0.nightly-2024-09-23-182657   True        True          True       30h     Samples installation in error at 4.18.0-0.nightly-2024-09-23-182657: &errors.errorString{s:"global openshift image configuration prevents the creation of imagestreams using the registry "}

yuqi-zhang avatar Sep 26 '24 19:09 yuqi-zhang

but I'm pretty sure it will degrade when you upgrade the cluster

Did you verify this? It sounds to me like this should be a pre-upgrade check and that the operator should report upgradeable false if it determines this scenario, or, can it not be determined prior to the upgrade being initiated because all of the existing MCs are at the correct version before this change was initiated?

JoelSpeed avatar Sep 27 '24 10:09 JoelSpeed

@djoshy and I tested this, and indeed if the controller version changes (i.e. an upgrade), the pools become degraded and upgradeable=false gets set. Interestingly the CO itself doesn't seem to degrade, which I've opened a side bug here: https://issues.redhat.com/browse/OCPBUGS-42546

It sounds to me like this should be a pre-upgrade check and that the operator should report upgradeable false if it determines this scenario, or, can it not be determined prior to the upgrade being initiated because all of the existing MCs are at the correct version before this change was initiated?

The current setup is that until a new controller rolls out, the render controller isn't able to determine this. That said, the containerruntimeconfigcontroller should either be able to degrade the pool or the CO when the failure initially happens, so we don't get to this stage I think. If we're in consensus there, I can create a card to track adding that failure mode.

For this validation specifically I think having it at the CEL level would be better than the MCO behaviour currently

yuqi-zhang avatar Sep 27 '24 14:09 yuqi-zhang

Ok, I think we should proceed with the CEL, but I'd like to try and test the ratcheting scenario. Ratcheting validation was added in 1.30, so now that OCP is based on 1.31, we should be able to leverage it safely (delayed a release because CRDs are upgraded prior to API servers in some cases)

JoelSpeed avatar Oct 08 '24 16:10 JoelSpeed