origin icon indicating copy to clipboard operation
origin copied to clipboard

OCPBUGS-59176: fix several failing tests in gcp-custom-dns job

Open lihongan opened this issue 4 months ago • 21 comments

Some e2e tests are failing with the job "gcp-custom-dns" for featuregate "GCPClusterHostedDNSInstall" which is promoted to GA in 4.20. In the "custom-dns" cluster OpenShift will start static CoreDNS pods to provide DNS resolution for API, Internal API and Ingress services that are essential for cluster creation. After cluster deployment is completed, the customer will update their external DNS solution with the same assigned LB IP addresses used for the configuration of the internal CoreDNS instance.

The failing tests like http2 and grpc tests use dedicated ingresscontrollers, and gateway also has separated LB and dnsrecord, so the default wildcard created by the new static CoreDNS won't work for those tests.

To fix the failing tests, we could force the request to use LoadBalancer IP address directly and bypass the DNS resolution.

Also update http2/grpc shard ingressconroller to NOT use domain like "e2e-test-xxx.apps.baseDomain" to avoid overlapping with default wildcard "*.apps.baseDomain".

lihongan avatar Aug 19 '25 10:08 lihongan

@lihongan: This pull request references Jira Issue OCPBUGS-59176, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @lihongan

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Some e2e tests are failing with the job "gcp-custom-dns" for featuregate "GCPClusterHostedDNSInstall" which is promoted to GA in 4.20. In the "custom-dns" cluster OpenShift will start static CoreDNS pods to provide DNS resolution for API, Internal API and Ingress services that are essential for cluster creation. After cluster deployment is completed, the customer will update their external DNS solution with the same assigned LB IP addresses used for the configuration of the internal CoreDNS instance.

The failing tests like http2 and grpc tests use dedicated ingresscontrollers, and gateway also has separated LB and dnsrecord, so the default wildcard created by the new static CoreDNS won't work for those tests.

To fix the failing tests, we could force the request to use LoadBalancer IP address directly and bypass the DNS resolution.

Also update http2/grpc shard ingressconroller to NOT use domain like "e2e-test-xxx.apps.<baseDomain>" to avoid overlapping with default wildcard "*.apps.<baseDomain>".

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Aug 19 '25 10:08 openshift-ci-robot

@openshift-ci-robot: GitHub didn't allow me to request PR reviews from the following users: lihongan.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@lihongan: This pull request references Jira Issue OCPBUGS-59176, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.20.0) matches configured target version for branch (4.20.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @lihongan

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Some e2e tests are failing with the job "gcp-custom-dns" for featuregate "GCPClusterHostedDNSInstall" which is promoted to GA in 4.20. In the "custom-dns" cluster OpenShift will start static CoreDNS pods to provide DNS resolution for API, Internal API and Ingress services that are essential for cluster creation. After cluster deployment is completed, the customer will update their external DNS solution with the same assigned LB IP addresses used for the configuration of the internal CoreDNS instance.

The failing tests like http2 and grpc tests use dedicated ingresscontrollers, and gateway also has separated LB and dnsrecord, so the default wildcard created by the new static CoreDNS won't work for those tests.

To fix the failing tests, we could force the request to use LoadBalancer IP address directly and bypass the DNS resolution.

Also update http2/grpc shard ingressconroller to NOT use domain like "e2e-test-xxx.apps.<baseDomain>" to avoid overlapping with default wildcard "*.apps.<baseDomain>".

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci[bot] avatar Aug 19 '25 10:08 openshift-ci[bot]

/hold the gRPC DialContext is still not fixed yet

lihongan avatar Aug 19 '25 10:08 lihongan

/assign

As a continuation of https://github.com/openshift/origin/pull/29985.

alebedev87 avatar Aug 19 '25 15:08 alebedev87

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lihongan Once this PR has been reviewed and has the lgtm label, please ask for approval from alebedev87. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Aug 20 '25 10:08 openshift-ci[bot]

/unhold gRPC dialer is updated as well to allow it send request to LB directly if DNS doesn't not work.

lihongan avatar Aug 20 '25 10:08 lihongan

Job Failure Risk Analysis for sha: ae965e978cffcf821bd60582102548e3fff18c35

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-disruptive Medium
[sig-arch] events should not repeat pathologically for ns/openshift-kube-apiserver-operator
Potential external regression detected for High Risk Test analysis
---
[sig-node] static pods should start after being created
Potential external regression detected for High Risk Test analysis
---
[bz-Etcd] clusteroperator/etcd should not change condition/Available
Potential external regression detected for High Risk Test analysis
---
[sig-cli][OCPFeatureGate:UpgradeStatus] oc amd upgrade status never fails
Potential external regression detected for High Risk Test analysis

openshift-trt[bot] avatar Aug 20 '25 15:08 openshift-trt[bot]

/retest-required

lihongan avatar Aug 25 '25 02:08 lihongan

/retest-required

lihongan avatar Aug 26 '25 01:08 lihongan

/retest-required

lihongan avatar Aug 26 '25 06:08 lihongan

/test e2e-gcp-ovn-techpreview-serial-2of2

sdodson avatar Aug 26 '25 20:08 sdodson

/payload-job-with-prs periodic-ci-openshift-release-master-nightly-4.20-e2e-gcp-custom-dns-techpreview https://github.com/openshift/release/pull/68515

lihongan avatar Aug 27 '25 05:08 lihongan

@lihongan: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

openshift-ci[bot] avatar Aug 27 '25 05:08 openshift-ci[bot]

/payload-job-with-prs periodic-ci-openshift-release-master-nightly-4.20-e2e-gcp-custom-dns-techpreview openshift/release#68515

alebedev87 avatar Aug 27 '25 13:08 alebedev87

@alebedev87: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.20-e2e-gcp-custom-dns-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7bc283a0-8346-11f0-940f-c2b3445376f7-0

openshift-ci[bot] avatar Aug 27 '25 13:08 openshift-ci[bot]

/payload-job-with-prs periodic-ci-openshift-release-master-nightly-4.20-e2e-gcp-custom-dns-techpreview openshift/release#68515

Thank you, Andrew. Looks the job you triggered failed at install, Let me retest

lihongan avatar Aug 28 '25 02:08 lihongan

@lihongan: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.20-e2e-gcp-custom-dns-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c0f7d490-83b2-11f0-9c8f-1999f28ec0d9-0

openshift-ci[bot] avatar Aug 28 '25 02:08 openshift-ci[bot]

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-merge-robot avatar Oct 17 '25 12:10 openshift-merge-robot

@lihongan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ipi-ovn-kube-apiserver-rollout 88d6ced25b6b44592780cc5014c0bee8765866bb link false /test e2e-metal-ipi-ovn-kube-apiserver-rollout
ci/prow/e2e-metal-ipi-virtualmedia 88d6ced25b6b44592780cc5014c0bee8765866bb link false /test e2e-metal-ipi-virtualmedia
ci/prow/e2e-aws-disruptive 88d6ced25b6b44592780cc5014c0bee8765866bb link false /test e2e-aws-disruptive
ci/prow/e2e-metal-ipi-serial-1of2 88d6ced25b6b44592780cc5014c0bee8765866bb link false /test e2e-metal-ipi-serial-1of2
ci/prow/e2e-metal-ipi-serial-2of2 88d6ced25b6b44592780cc5014c0bee8765866bb link false /test e2e-metal-ipi-serial-2of2
ci/prow/e2e-metal-ipi-serial-ovn-ipv6-2of2 88d6ced25b6b44592780cc5014c0bee8765866bb link false /test e2e-metal-ipi-serial-ovn-ipv6-2of2
ci/prow/e2e-aws-proxy 88d6ced25b6b44592780cc5014c0bee8765866bb link false /test e2e-aws-proxy
ci/prow/e2e-aws-ovn-single-node-upgrade 88d6ced25b6b44592780cc5014c0bee8765866bb link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-aws-ovn-kube-apiserver-rollout 88d6ced25b6b44592780cc5014c0bee8765866bb link false /test e2e-aws-ovn-kube-apiserver-rollout
ci/prow/e2e-metal-ipi-ovn-dualstack-local-gateway 88d6ced25b6b44592780cc5014c0bee8765866bb link false /test e2e-metal-ipi-ovn-dualstack-local-gateway
ci/prow/e2e-openstack-ovn 88d6ced25b6b44592780cc5014c0bee8765866bb link false /test e2e-openstack-ovn
ci/prow/e2e-metal-ipi-serial-ovn-ipv6-1of2 88d6ced25b6b44592780cc5014c0bee8765866bb link false /test e2e-metal-ipi-serial-ovn-ipv6-1of2
ci/prow/e2e-metal-ipi-ovn 88d6ced25b6b44592780cc5014c0bee8765866bb link false /test e2e-metal-ipi-ovn
ci/prow/e2e-metal-ipi-ovn-dualstack 88d6ced25b6b44592780cc5014c0bee8765866bb link false /test e2e-metal-ipi-ovn-dualstack
ci/prow/e2e-gcp-csi 88d6ced25b6b44592780cc5014c0bee8765866bb link true /test e2e-gcp-csi
ci/prow/e2e-aws-csi 88d6ced25b6b44592780cc5014c0bee8765866bb link true /test e2e-aws-csi
ci/prow/go-verify-deps 88d6ced25b6b44592780cc5014c0bee8765866bb link true /test go-verify-deps
ci/prow/e2e-aws-ovn-microshift 88d6ced25b6b44592780cc5014c0bee8765866bb link true /test e2e-aws-ovn-microshift
ci/prow/e2e-aws-ovn-microshift-serial 88d6ced25b6b44592780cc5014c0bee8765866bb link true /test e2e-aws-ovn-microshift-serial
ci/prow/e2e-metal-ipi-ovn-ipv6 88d6ced25b6b44592780cc5014c0bee8765866bb link true /test e2e-metal-ipi-ovn-ipv6

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Nov 18 '25 12:11 openshift-ci[bot]

Job Failure Risk Analysis for sha: 88d6ced25b6b44592780cc5014c0bee8765866bb

Job Name Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-single-node-upgrade Medium
Job run should complete before timeout
This test has passed 91.46% of 5282 runs on release 4.21 [Overall] in the last week.
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-ipv6 IncompleteTests
Tests for this run (2) are below the historical average (2444): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

openshift-trt[bot] avatar Nov 18 '25 13:11 openshift-trt[bot]