api icon indicating copy to clipboard operation
api copied to clipboard

Promote OnClusterBuild featuregate to default

Open yuqi-zhang opened this issue 9 months ago • 14 comments

Final step to GA On Cluster Layering, after https://github.com/openshift/machine-config-operator/pull/4756 and https://github.com/openshift/api/pull/2134

See also updated enhancement: https://github.com/openshift/enhancements/pull/1515

yuqi-zhang avatar Feb 04 '25 00:02 yuqi-zhang

Hello @yuqi-zhang! Some important instructions when contributing to openshift/api: API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

openshift-ci[bot] avatar Feb 04 '25 00:02 openshift-ci[bot]

/hold

Currently for testing

yuqi-zhang avatar Feb 05 '25 00:02 yuqi-zhang

Given the absence of automated testing reporting into component readiness, there is a want to make an exception for this feature and promote within 4.19 based on QE testing alone

Could we please get a write up on this PR detailing the existing testing that QE have been completing, with links to historical pass rates where available

I would also appreciate if we can link out to any trackers for the future automated regression testing that's in the pipeline

JoelSpeed avatar Mar 20 '25 12:03 JoelSpeed

Verify failure in schema checker is an issue with the alpha API, we will override that issue

/test e2e-aws-ovn-hypershift /test minor-e2e-upgrade-minor

JoelSpeed avatar Mar 20 '25 12:03 JoelSpeed

@yuqi-zhang @cheesesashimi The failures in the minor upgrade look like they are legitimate, can you please investigate?

JoelSpeed avatar Mar 20 '25 16:03 JoelSpeed

Rebased on master just in case. It looks like in the failed upgrade, the MCO hasn't started upgrade yet (?) and the operator logs are failing on:

E0320 14:53:14.386733       1 reflector.go:158] "Unhandled Error" err="github.com/openshift/machine-config-operator/pkg/operator/operator.go:390: Failed to watch *v1alpha1.MachineOSConfig: failed to list *v1alpha1.MachineOSConfig: the server could not find the requested resource (get machineosconfigs.machineconfiguration.openshift.io)"

checking if that was caused by outdated CRDs.

The kube-rbac-proxy failures should also be old, since the pods should have the proper SCC labels now

yuqi-zhang avatar Mar 21 '25 01:03 yuqi-zhang

/retest

JoelSpeed avatar Mar 21 '25 10:03 JoelSpeed

ci/prow/minor-e2e-upgrade-minor seems to be testing 4.17->4.18 upgrade, is that expected? I would have expected this to test 4.18->4.19

yuqi-zhang avatar Mar 21 '25 14:03 yuqi-zhang

  • Jobs running MCO test cases in a cluster with OCL enabled in master and worker pools

periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-nightly-aws-ipi-longrun-mco-tp-ocl-p3-f7

periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-nightly-aws-ipi-longrun-mco-tp-ocl-p2-f7

periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-nightly-aws-ipi-longrun-mco-tp-ocl-p1-f7

These jobs are currently impacted by this issue: OCPBUGS-53408 In OCL. error: Old and new refs are equal

  • Job running openshift e2e tests in a cluster with OCL configured in master and worker pools

periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-nightly-aws-ipi-tp-ocl-f7

An OCL cluster does not behave exactly the same as a non-OCL cluster, so some test cases may fail in an OCL cluster because of that. Especially the ones that scale up new nodes (OCL needs an extra reboot) or configure registries.conf (OCL clusters reboot the nodes but non-OCL cluster don't do it)

  • Job running an upgrade in a cluster with OCL configured in master and worker pools

periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-nightly-4.19-upgrade-from-stable-4.19-aws-ipi-ocl-fips-f60

Same case as above, since an OCL cluster doesn't behave exactly as a non-OCL cluster some cases will never pass. Especially the test cases involving scaling up new nodes (OCL needs an extra reboot to apply the new image), and those configuring the registres.conf file (OCL clusters reboots the nodes in this scenario, but non-OCL clusters do not reboot the nodes).

  • Jobs running all MCO test cases in clusters with Techpreview. Specific OCL tests included

These jobs have been reconfigured to run daily

IPI on GCP, AMD64,TP

periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-nightly-gcp-ipi-longduration-tp-mco-p1-f9

periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-nightly-gcp-ipi-longduration-tp-mco-p2-f9

periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-nightly-gcp-ipi-longduration-tp-mco-p3-f9

IPI on AWS, ARM64,PROXY,FIPS,TP

These jobs are currently impacted by OCPBUGS-49894 In OCL. Disabling OCL process is not working in clusters with proxy

Unfortunately this issue breaks the cluster and we cannot recover it, so once our tests hit this issue all tests will report failures or refuse to be executed.

periodic-ci-openshift-openshift-tests-private-release-4.19-arm64-nightly-aws-ipi-longrun-mco-tp-proxy-fips-p1-f28

periodic-ci-openshift-openshift-tests-private-release-4.19-arm64-nightly-aws-ipi-longrun-mco-tp-proxy-fips-p2-f28

periodic-ci-openshift-openshift-tests-private-release-4.19-arm64-nightly-aws-ipi-longrun-mco-tp-proxy-fips-p3-f28

I have just launched all the jobs with the latest nightly build.

sergiordlr avatar Mar 21 '25 15:03 sergiordlr

Thanks for the writeup, this is useful context!

ci/prow/minor-e2e-upgrade-minor seems to be testing 4.17->4.18 upgrade, is that expected? I would have expected this to test 4.18->4.19

No, that's an error, this hasn't been updated since branching, we really need to make a way to automate this, I will get on it.

These jobs are currently impacted by this issue: OCPBUGS-53408 In OCL. error: Old and new refs are equal

Based on feedback from some promotions in the previous release, we would ideally see a week of clean, daily runs of the CI prior to merging. We should aim to prio fixing this ASAP (and the other issue mentioned later)

An OCL cluster does not behave exactly the same as a non-OCL cluster, so some test cases may fail in an OCL cluster because of that. Especially the ones that scale up new nodes (OCL needs an extra reboot) or configure registries.conf (OCL clusters reboot the nodes but non-OCL cluster don't do it)

How do you ensure you understand this signal when there are so many false positives? Is work being done to mitigate the false positives and resolve the differences between the OCL and non-OCL cases?

Jobs running all MCO test cases in clusters with Techpreview. Specific OCL tests included These jobs have been reconfigured to run daily

Once promoted, we will also need the stable cluster configuration to run daily, has this also been adjusted?

JoelSpeed avatar Mar 24 '25 10:03 JoelSpeed

/test minor-e2e-upgrade-minor

JoelSpeed avatar Mar 24 '25 14:03 JoelSpeed

E2E minor is now doing the correct thing, lets see if it works this time!

JoelSpeed avatar Mar 24 '25 14:03 JoelSpeed

/retest

JoelSpeed avatar Apr 07 '25 14:04 JoelSpeed

Adding some notes here from conversations so that we capture them within the promotion process:

@cheesesashimi has gather pass rates for the last week and put them together in a sheet.

Looking at this, and based on our conversation in slack, we can see that our top 5 tests are showing a minimum 85% pass rate, and, that we have on average 5 runs per platform right now.

For reference, we would normally be asking for at least 5 tests, and for all tests to attain a 95% pass rate across 6 platforms with 14 runs over 2 weeks to promote any regular feature.

It would be good to understand what is causing the tests to fail, are there issues with the feature itself, or can they be attributed to other failures?

JoelSpeed avatar Apr 08 '25 16:04 JoelSpeed

/skip /retest-required

JoelSpeed avatar Apr 14 '25 09:04 JoelSpeed

Rebased just for good measure

yuqi-zhang avatar Apr 16 '25 15:04 yuqi-zhang

/retest

yuqi-zhang avatar Apr 17 '25 11:04 yuqi-zhang

/retest

yuqi-zhang avatar Apr 17 '25 20:04 yuqi-zhang

/retest

cheesesashimi avatar Apr 18 '25 15:04 cheesesashimi

/test e2e-aws-serial-techpreview e2e-upgrade minor-e2e-upgrade-minor e2e-aws-serial e2e-aws-ovn e2e-aws-ovn-techpreview e2e-upgrade-out-of-change

I believe all of the CI issues have cleared up so lets see if we can get some of those to pass.

sdodson avatar Apr 18 '25 19:04 sdodson

The verify-crd-schema failure seems to be related to the fact that we renamed a file to machineconfiguration/v1alpha1/zz_generated.crd-manifests/0000_80_machine-config_01_machineosbuilds.crd.yaml. According to the error message output by the script, overriding this job failure is appropriate in this scenario:

This verifier checks all files that have changed. In some cases you may have changed or renamed a file that already contained api violations, but you are not introducing a new violation. In such cases it is appropriate to /override the failing CI job.

cheesesashimi avatar Apr 18 '25 19:04 cheesesashimi

/override ci/prow/verify-crd-schema

As explained above, this happened through a rename of an existing file. These are from existing problems that we cannot rectify without widespread consideration for the impacts

JoelSpeed avatar Apr 21 '25 12:04 JoelSpeed

@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/verify-crd-schema

In response to this:

/override ci/prow/verify-crd-schema

As explained above, this happened through a rename of an existing file. These are from existing problems that we cannot rectify without widespread consideration for the impacts

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci[bot] avatar Apr 21 '25 12:04 openshift-ci[bot]

/test e2e-aws-serial-techpreview e2e-upgrade minor-e2e-upgrade-minor e2e-aws-serial e2e-aws-ovn e2e-aws-ovn-techpreview e2e-upgrade-out-of-change

Try rerunning these since the failures did not appear related to this PR.

cheesesashimi avatar Apr 21 '25 17:04 cheesesashimi

/test e2e-aws-serial-techpreview e2e-upgrade minor-e2e-upgrade-minor e2e-aws-serial e2e-aws-ovn e2e-aws-ovn-techpreview e2e-upgrade-out-of-change

cheesesashimi avatar Apr 22 '25 12:04 cheesesashimi

The serial suite is pretty broken at the moment, since the suite passed on Friday, modulo a deprovision timeout, I think we can override this one

/override ci/prow/e2e-aws-serial-techpreview

JoelSpeed avatar Apr 22 '25 16:04 JoelSpeed

@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/e2e-aws-serial-techpreview

In response to this:

The serial suite is pretty broken at the moment, since the suite passed on Friday, modulo a deprovision timeout, I think we can override this one

/override ci/prow/e2e-aws-serial-techpreview

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci[bot] avatar Apr 22 '25 16:04 openshift-ci[bot]

/override ci/prow/e2e-aws-serial

This passed yesterday, and the base of main has only moved for a tooling change since, so I'm confident that this wouldn't have re-run if we hadn't merged that tooling change

JoelSpeed avatar Apr 22 '25 17:04 JoelSpeed

@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/e2e-aws-serial

In response to this:

/override ci/prow/e2e-aws-serial

This passed yesterday, and the base of main has only moved for a tooling change since, so I'm confident that this wouldn't have re-run if we hadn't merged that tooling change

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci[bot] avatar Apr 22 '25 17:04 openshift-ci[bot]

/override ci/prow/minor-e2e-upgrade-minor

This passed on Friday, but the tooling change meant it had to be re-run. I'm confident the tooling changes haven't impacted this result and so the value from Friday should still be valid

JoelSpeed avatar Apr 22 '25 17:04 JoelSpeed