api icon indicating copy to clipboard operation
api copied to clipboard

CORS-3594: Setting CAPG as the default infra provider

Open barbacbd opened this issue 1 year ago • 19 comments

** CAPG should be used as the default infra provider for GCP installs.

barbacbd avatar Jul 10 '24 18:07 barbacbd

@barbacbd: This pull request references CORS-3594 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.17.0" version, but no target version was set.

In response to this:

** CAPG should be used as the default infra provider for GCP installs.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Jul 10 '24 18:07 openshift-ci-robot

Hello @barbacbd! Some important instructions when contributing to openshift/api: API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

openshift-ci[bot] avatar Jul 10 '24 18:07 openshift-ci[bot]

/label platform/google

barbacbd avatar Jul 10 '24 18:07 barbacbd

/cc @patrickdillon /cc @r4f4 /cc @bfournie

barbacbd avatar Jul 10 '24 18:07 barbacbd

@barbacbd: The label(s) platform/google cannot be applied, because the repository doesn't have them.

In response to this:

/label platform/google

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci[bot] avatar Jul 10 '24 18:07 openshift-ci[bot]

/retest-required

barbacbd avatar Jul 11 '24 17:07 barbacbd

/test verify

now that https://github.com/openshift/api/pull/1909 merged. It might be ready. if not, try again in an hour.

deads2k avatar Jul 15 '24 17:07 deads2k

#1909 fixed the tests, which are now failing with:

 INSUFFICIENT CI testing for "ClusterAPIInstallGCP".
F0715 17:49:34.051041  169158 root.go:64] Error running codegen: error: "install should succeed: infrastructure" only passed 71%, need at least 95% for "ClusterAPIInstallGCP" on {gcp amd64 ha} 

The figure 71% seems off to me. That is, I don't think the infrastructure provisioning success rate is that low. I'm not sure where the discrepancy is coming from.

I'm reviewing the GCP Tech preview installs here: https://sippy.dptools.openshift.org/sippy-ng/jobs/4.17/runs?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-techpreview%22%7D%5D%7D&pageSize=100&sort=desc&sortField=timestamp

Reviewing these failures, the significant one I see is the credentials request failure which recurs multiple times, including this example: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-techpreview/1806798263932686336

That issue was not related to ClusterAPIInstallGCP and was fixed in: https://issues.redhat.com/browse/OCPBUGS-36294

The only issue I see related to ClusterAPIInstallGCP is

level=error msg=failed to fetch Cluster: failed to generate asset "Cluster": failed to create cluster: failed during pre-provisioning: failed to add worker roles: failed to set project IAM policy: googleapi: Error 409: There were concurrent policy changes. Please retry the whole read-modify-write with exponential backoff. The request's ETag '\007\006\033\255\347+\335\210' did not match the current policy's ETag '\007\006\033\255\347>%\332'., aborted
Installer 

from: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.17-e2e-gcp-ovn-techpreview/1805429311365189632

That is something we'll want to fix and would potentially be fixed by https://issues.redhat.com/browse/CORS-3567

patrickdillon avatar Jul 15 '24 18:07 patrickdillon

/test verify

stbenjam avatar Jul 15 '24 18:07 stbenjam

The figure 71% seems off to me. That is, I don't think the infrastructure provisioning success rate is that low. I'm not sure where the discrepancy is coming from.

I'm looking to figure out where the 71% number came from but techpreview gcp infra is low. The default sippy view is "Working" which is flake + success. For this we're using success only.

Sippy is currently saying 89% (There's a toggle in the toolbar to switch between working and passing)

https://sippy.dptools.openshift.org/sippy-ng/tests/4.17/details?filters=%257B%2522items%2522%253A%255B%257B%2522columnField%2522%253A%2522current_runs%2522%252C%2522operatorValue%2522%253A%2522%253E%253D%2522%252C%2522value%2522%253A%25227%2522%257D%252C%257B%2522columnField%2522%253A%2522variants%2522%252C%2522not%2522%253Afalse%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522Platform%253Agcp%2522%257D%252C%257B%2522columnField%2522%253A%2522variants%2522%252C%2522not%2522%253Afalse%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522FeatureSet%253Atechpreview%2522%257D%252C%257B%2522columnField%2522%253A%2522variants%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522Topology%253Aha%2522%257D%252C%257B%2522id%2522%253A99%252C%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522infrastructure%2522%257D%252C%257B%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522install%2520should%2522%257D%255D%252C%2522linkOperator%2522%253A%2522and%2522%257D&period=default&sort=asc&sortField=current_pass_percentage&view=Passing

stbenjam avatar Jul 15 '24 18:07 stbenjam

Non-techpreview GCP definitely has much higher infra success https://sippy.dptools.openshift.org/sippy-ng/tests/4.17/details?filters=%257B%2522items%2522%253A%255B%257B%2522columnField%2522%253A%2522current_runs%2522%252C%2522operatorValue%2522%253A%2522%253E%253D%2522%252C%2522value%2522%253A%25227%2522%257D%252C%257B%2522columnField%2522%253A%2522variants%2522%252C%2522not%2522%253Afalse%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522Platform%253Agcp%2522%257D%252C%257B%2522columnField%2522%253A%2522variants%2522%252C%2522not%2522%253Atrue%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522FeatureSet%253Atechpreview%2522%257D%252C%257B%2522columnField%2522%253A%2522variants%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522Topology%253Aha%2522%257D%252C%257B%2522id%2522%253A99%252C%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522infrastructure%2522%257D%252C%257B%2522columnField%2522%253A%2522name%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522install%2520should%2522%257D%252C%257B%2522columnField%2522%253A%2522variants%2522%252C%2522operatorValue%2522%253A%2522contains%2522%252C%2522value%2522%253A%2522Architecture%253Aamd64%2522%257D%255D%252C%2522linkOperator%2522%253A%2522and%2522%257D&period=default&sort=asc&sortField=current_pass_percentage&view=Passing

stbenjam avatar Jul 15 '24 18:07 stbenjam

@2uasimojo: This PR was included in a payload test run from openshift/installer#8723 trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-hive-master-periodic-e2e-gcp-weekly

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e013e7f0-4453-11ef-8e9d-1b4fee3fe2e1-0

openshift-ci[bot] avatar Jul 17 '24 15:07 openshift-ci[bot]

/test verify

patrickdillon avatar Jul 29 '24 19:07 patrickdillon

Just reran verify and it looks like our bug fixes are paying off and we're trending in the right direction (86%, up from 71%):

 F0729 19:20:08.510204  169977 root.go:64] Error running codegen: error: "install should succeed: infrastructure" only passed 86%, need at least 95% for "ClusterAPIInstallGCP" on {gcp amd64 ha} 

patrickdillon avatar Jul 29 '24 19:07 patrickdillon

/test verify

patrickdillon avatar Jul 31 '24 14:07 patrickdillon

/lgtm

bfournie avatar Jul 31 '24 14:07 bfournie

I ran ~20 GCP techpreview jobs yesterday using gangway. Looking at the infrastructure test links that @stbenjam posted above, I believe we are now seeing a success rate ~98%:

GCP TechPreview Infrastructure

This seems to be actually higher than the non-tech preview tests, which are at around 96-97%:

Non-tech preview

In other words, despite verify test failures, this is looking good to me in regards to CI testing.

patrickdillon avatar Jul 31 '24 14:07 patrickdillon

/test verify

patrickdillon avatar Aug 01 '24 18:08 patrickdillon

/lgtm

r4f4 avatar Aug 02 '24 11:08 r4f4

/retest-required /skip

patrickdillon avatar Aug 02 '24 15:08 patrickdillon

/lgtm

JoelSpeed avatar Aug 02 '24 15:08 JoelSpeed

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: barbacbd, bfournie, JoelSpeed, r4f4

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Aug 02 '24 15:08 openshift-ci[bot]

@barbacbd: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-azure a912e2ab1441dda7183d1af14b4a0252a118934a link false /test e2e-azure

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Aug 02 '24 20:08 openshift-ci[bot]

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-config-api This PR has been included in build ose-cluster-config-api-container-v4.18.0-202408022143.p0.g346347b.assembly.stream.el9. All builds following this will include this PR.

openshift-bot avatar Aug 02 '24 22:08 openshift-bot