api
api copied to clipboard
Promote OnClusterBuild featuregate to default
Final step to GA On Cluster Layering, after https://github.com/openshift/machine-config-operator/pull/4756 and https://github.com/openshift/api/pull/2134
See also updated enhancement: https://github.com/openshift/enhancements/pull/1515
Hello @yuqi-zhang! Some important instructions when contributing to openshift/api: API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.
/hold
Currently for testing
Given the absence of automated testing reporting into component readiness, there is a want to make an exception for this feature and promote within 4.19 based on QE testing alone
Could we please get a write up on this PR detailing the existing testing that QE have been completing, with links to historical pass rates where available
I would also appreciate if we can link out to any trackers for the future automated regression testing that's in the pipeline
Verify failure in schema checker is an issue with the alpha API, we will override that issue
/test e2e-aws-ovn-hypershift /test minor-e2e-upgrade-minor
@yuqi-zhang @cheesesashimi The failures in the minor upgrade look like they are legitimate, can you please investigate?
Rebased on master just in case. It looks like in the failed upgrade, the MCO hasn't started upgrade yet (?) and the operator logs are failing on:
E0320 14:53:14.386733 1 reflector.go:158] "Unhandled Error" err="github.com/openshift/machine-config-operator/pkg/operator/operator.go:390: Failed to watch *v1alpha1.MachineOSConfig: failed to list *v1alpha1.MachineOSConfig: the server could not find the requested resource (get machineosconfigs.machineconfiguration.openshift.io)"
checking if that was caused by outdated CRDs.
The kube-rbac-proxy failures should also be old, since the pods should have the proper SCC labels now
/retest
ci/prow/minor-e2e-upgrade-minor seems to be testing 4.17->4.18 upgrade, is that expected? I would have expected this to test 4.18->4.19
- Jobs running MCO test cases in a cluster with OCL enabled in master and worker pools
These jobs are currently impacted by this issue: OCPBUGS-53408 In OCL. error: Old and new refs are equal
- Job running openshift e2e tests in a cluster with OCL configured in master and worker pools
periodic-ci-openshift-openshift-tests-private-release-4.19-amd64-nightly-aws-ipi-tp-ocl-f7
An OCL cluster does not behave exactly the same as a non-OCL cluster, so some test cases may fail in an OCL cluster because of that. Especially the ones that scale up new nodes (OCL needs an extra reboot) or configure registries.conf (OCL clusters reboot the nodes but non-OCL cluster don't do it)
- Job running an upgrade in a cluster with OCL configured in master and worker pools
Same case as above, since an OCL cluster doesn't behave exactly as a non-OCL cluster some cases will never pass. Especially the test cases involving scaling up new nodes (OCL needs an extra reboot to apply the new image), and those configuring the registres.conf file (OCL clusters reboots the nodes in this scenario, but non-OCL clusters do not reboot the nodes).
- Jobs running all MCO test cases in clusters with Techpreview. Specific OCL tests included
These jobs have been reconfigured to run daily
IPI on GCP, AMD64,TP
IPI on AWS, ARM64,PROXY,FIPS,TP
These jobs are currently impacted by OCPBUGS-49894 In OCL. Disabling OCL process is not working in clusters with proxy
Unfortunately this issue breaks the cluster and we cannot recover it, so once our tests hit this issue all tests will report failures or refuse to be executed.
I have just launched all the jobs with the latest nightly build.
Thanks for the writeup, this is useful context!
ci/prow/minor-e2e-upgrade-minor seems to be testing 4.17->4.18 upgrade, is that expected? I would have expected this to test 4.18->4.19
No, that's an error, this hasn't been updated since branching, we really need to make a way to automate this, I will get on it.
These jobs are currently impacted by this issue: OCPBUGS-53408 In OCL. error: Old and new refs are equal
Based on feedback from some promotions in the previous release, we would ideally see a week of clean, daily runs of the CI prior to merging. We should aim to prio fixing this ASAP (and the other issue mentioned later)
An OCL cluster does not behave exactly the same as a non-OCL cluster, so some test cases may fail in an OCL cluster because of that. Especially the ones that scale up new nodes (OCL needs an extra reboot) or configure registries.conf (OCL clusters reboot the nodes but non-OCL cluster don't do it)
How do you ensure you understand this signal when there are so many false positives? Is work being done to mitigate the false positives and resolve the differences between the OCL and non-OCL cases?
Jobs running all MCO test cases in clusters with Techpreview. Specific OCL tests included These jobs have been reconfigured to run daily
Once promoted, we will also need the stable cluster configuration to run daily, has this also been adjusted?
/test minor-e2e-upgrade-minor
E2E minor is now doing the correct thing, lets see if it works this time!
/retest
Adding some notes here from conversations so that we capture them within the promotion process:
@cheesesashimi has gather pass rates for the last week and put them together in a sheet.
Looking at this, and based on our conversation in slack, we can see that our top 5 tests are showing a minimum 85% pass rate, and, that we have on average 5 runs per platform right now.
For reference, we would normally be asking for at least 5 tests, and for all tests to attain a 95% pass rate across 6 platforms with 14 runs over 2 weeks to promote any regular feature.
It would be good to understand what is causing the tests to fail, are there issues with the feature itself, or can they be attributed to other failures?
/skip /retest-required
Rebased just for good measure
/retest
/retest
/retest
/test e2e-aws-serial-techpreview e2e-upgrade minor-e2e-upgrade-minor e2e-aws-serial e2e-aws-ovn e2e-aws-ovn-techpreview e2e-upgrade-out-of-change
I believe all of the CI issues have cleared up so lets see if we can get some of those to pass.
The verify-crd-schema failure seems to be related to the fact that we renamed a file to machineconfiguration/v1alpha1/zz_generated.crd-manifests/0000_80_machine-config_01_machineosbuilds.crd.yaml. According to the error message output by the script, overriding this job failure is appropriate in this scenario:
This verifier checks all files that have changed. In some cases you may have changed or renamed a file that already contained api violations, but you are not introducing a new violation. In such cases it is appropriate to /override the failing CI job.
/override ci/prow/verify-crd-schema
As explained above, this happened through a rename of an existing file. These are from existing problems that we cannot rectify without widespread consideration for the impacts
@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/verify-crd-schema
In response to this:
/override ci/prow/verify-crd-schema
As explained above, this happened through a rename of an existing file. These are from existing problems that we cannot rectify without widespread consideration for the impacts
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/test e2e-aws-serial-techpreview e2e-upgrade minor-e2e-upgrade-minor e2e-aws-serial e2e-aws-ovn e2e-aws-ovn-techpreview e2e-upgrade-out-of-change
Try rerunning these since the failures did not appear related to this PR.
/test e2e-aws-serial-techpreview e2e-upgrade minor-e2e-upgrade-minor e2e-aws-serial e2e-aws-ovn e2e-aws-ovn-techpreview e2e-upgrade-out-of-change
The serial suite is pretty broken at the moment, since the suite passed on Friday, modulo a deprovision timeout, I think we can override this one
/override ci/prow/e2e-aws-serial-techpreview
@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/e2e-aws-serial-techpreview
In response to this:
The serial suite is pretty broken at the moment, since the suite passed on Friday, modulo a deprovision timeout, I think we can override this one
/override ci/prow/e2e-aws-serial-techpreview
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/override ci/prow/e2e-aws-serial
This passed yesterday, and the base of main has only moved for a tooling change since, so I'm confident that this wouldn't have re-run if we hadn't merged that tooling change
@JoelSpeed: Overrode contexts on behalf of JoelSpeed: ci/prow/e2e-aws-serial
In response to this:
/override ci/prow/e2e-aws-serial
This passed yesterday, and the base of main has only moved for a tooling change since, so I'm confident that this wouldn't have re-run if we hadn't merged that tooling change
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/override ci/prow/minor-e2e-upgrade-minor
This passed on Friday, but the tooling change meant it had to be re-run. I'm confident the tooling changes haven't impacted this result and so the value from Friday should still be valid