origin OCPBUGS-59176: fix several failing tests in gcp-custom-dns job

Some e2e tests are failing with the job "gcp-custom-dns" for featuregate "GCPClusterHostedDNSInstall" which is promoted to GA in 4.20. In the "custom-dns" cluster OpenShift will start static CoreDNS pods to provide DNS resolution for API, Internal API and Ingress services that are essential for cluster creation. After cluster deployment is completed, the customer will update their external DNS solution with the same assigned LB IP addresses used for the configuration of the internal CoreDNS instance.

The failing tests like http2 and grpc tests use dedicated ingresscontrollers, and gateway also has separated LB and dnsrecord, so the default wildcard created by the new static CoreDNS won't work for those tests.

To fix the failing tests, we could force the request to use LoadBalancer IP address directly and bypass the DNS resolution.

Also update http2/grpc shard ingressconroller to NOT use domain like "e2e-test-xxx.apps.baseDomain" to avoid overlapping with default wildcard "*.apps.baseDomain".

Aug 19 '25 10:08 lihongan

@lihongan: This pull request references Jira Issue OCPBUGS-59176, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.20.0) matches configured target version for branch (4.20.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @lihongan

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Some e2e tests are failing with the job "gcp-custom-dns" for featuregate "GCPClusterHostedDNSInstall" which is promoted to GA in 4.20. In the "custom-dns" cluster OpenShift will start static CoreDNS pods to provide DNS resolution for API, Internal API and Ingress services that are essential for cluster creation. After cluster deployment is completed, the customer will update their external DNS solution with the same assigned LB IP addresses used for the configuration of the internal CoreDNS instance.

The failing tests like http2 and grpc tests use dedicated ingresscontrollers, and gateway also has separated LB and dnsrecord, so the default wildcard created by the new static CoreDNS won't work for those tests.

To fix the failing tests, we could force the request to use LoadBalancer IP address directly and bypass the DNS resolution.

Also update http2/grpc shard ingressconroller to NOT use domain like "e2e-test-xxx.apps.<baseDomain>" to avoid overlapping with default wildcard "*.apps.<baseDomain>".

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Aug 19 '25 10:08 openshift-ci-robot

@openshift-ci-robot: GitHub didn't allow me to request PR reviews from the following users: lihongan.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

@lihongan: This pull request references Jira Issue OCPBUGS-59176, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)

bug target version (4.20.0) matches configured target version for branch (4.20.0)

bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @lihongan

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Some e2e tests are failing with the job "gcp-custom-dns" for featuregate "GCPClusterHostedDNSInstall" which is promoted to GA in 4.20. In the "custom-dns" cluster OpenShift will start static CoreDNS pods to provide DNS resolution for API, Internal API and Ingress services that are essential for cluster creation. After cluster deployment is completed, the customer will update their external DNS solution with the same assigned LB IP addresses used for the configuration of the internal CoreDNS instance.

The failing tests like http2 and grpc tests use dedicated ingresscontrollers, and gateway also has separated LB and dnsrecord, so the default wildcard created by the new static CoreDNS won't work for those tests.

To fix the failing tests, we could force the request to use LoadBalancer IP address directly and bypass the DNS resolution.

Also update http2/grpc shard ingressconroller to NOT use domain like "e2e-test-xxx.apps.<baseDomain>" to avoid overlapping with default wildcard "*.apps.<baseDomain>".

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Aug 19 '25 10:08 openshift-ci[bot]

/hold the gRPC DialContext is still not fixed yet

Aug 19 '25 10:08 lihongan

/assign

As a continuation of https://github.com/openshift/origin/pull/29985.

Aug 19 '25 15:08 alebedev87

Xref: PR for skips on custom dns techpreview job.

Aug 19 '25 16:08 alebedev87

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lihongan Once this PR has been reviewed and has the lgtm label, please ask for approval from alebedev87. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

test/extended/router/OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Aug 20 '25 10:08 openshift-ci[bot]

/unhold gRPC dialer is updated as well to allow it send request to LB directly if DNS doesn't not work.

Aug 20 '25 10:08 lihongan

Job Failure Risk Analysis for sha: ae965e978cffcf821bd60582102548e3fff18c35

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-aws-disruptive	Medium [sig-arch] events should not repeat pathologically for ns/openshift-kube-apiserver-operator Potential external regression detected for High Risk Test analysis --- [sig-node] static pods should start after being created Potential external regression detected for High Risk Test analysis --- [bz-Etcd] clusteroperator/etcd should not change condition/Available Potential external regression detected for High Risk Test analysis --- [sig-cli][OCPFeatureGate:UpgradeStatus] oc amd upgrade status never fails Potential external regression detected for High Risk Test analysis

Job Name

Failure Risk

pull-ci-openshift-origin-main-e2e-aws-disruptive

Medium
[sig-arch] events should not repeat pathologically for ns/openshift-kube-apiserver-operator
Potential external regression detected for High Risk Test analysis
---
[sig-node] static pods should start after being created
Potential external regression detected for High Risk Test analysis
---
[bz-Etcd] clusteroperator/etcd should not change condition/Available
Potential external regression detected for High Risk Test analysis
---
[sig-cli][OCPFeatureGate:UpgradeStatus] oc amd upgrade status never fails
Potential external regression detected for High Risk Test analysis

Aug 20 '25 15:08 openshift-trt[bot]

/retest-required

Aug 25 '25 02:08 lihongan

/retest-required

Aug 26 '25 01:08 lihongan

/retest-required

Aug 26 '25 06:08 lihongan

/test e2e-gcp-ovn-techpreview-serial-2of2

Aug 26 '25 20:08 sdodson

/payload-job-with-prs periodic-ci-openshift-release-master-nightly-4.20-e2e-gcp-custom-dns-techpreview https://github.com/openshift/release/pull/68515

Aug 27 '25 05:08 lihongan

@lihongan: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

Aug 27 '25 05:08 openshift-ci[bot]

/payload-job-with-prs periodic-ci-openshift-release-master-nightly-4.20-e2e-gcp-custom-dns-techpreview openshift/release#68515

Aug 27 '25 13:08 alebedev87

@alebedev87: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.20-e2e-gcp-custom-dns-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7bc283a0-8346-11f0-940f-c2b3445376f7-0

Aug 27 '25 13:08 openshift-ci[bot]

/payload-job-with-prs periodic-ci-openshift-release-master-nightly-4.20-e2e-gcp-custom-dns-techpreview openshift/release#68515

Thank you, Andrew. Looks the job you triggered failed at install, Let me retest

Aug 28 '25 02:08 lihongan

@lihongan: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.20-e2e-gcp-custom-dns-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c0f7d490-83b2-11f0-9c8f-1999f28ec0d9-0

Aug 28 '25 02:08 openshift-ci[bot]

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Oct 17 '25 12:10 openshift-merge-robot

@lihongan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-metal-ipi-ovn-kube-apiserver-rollout	88d6ced25b6b44592780cc5014c0bee8765866bb	link	false	`/test e2e-metal-ipi-ovn-kube-apiserver-rollout`
ci/prow/e2e-metal-ipi-virtualmedia	88d6ced25b6b44592780cc5014c0bee8765866bb	link	false	`/test e2e-metal-ipi-virtualmedia`
ci/prow/e2e-aws-disruptive	88d6ced25b6b44592780cc5014c0bee8765866bb	link	false	`/test e2e-aws-disruptive`
ci/prow/e2e-metal-ipi-serial-1of2	88d6ced25b6b44592780cc5014c0bee8765866bb	link	false	`/test e2e-metal-ipi-serial-1of2`
ci/prow/e2e-metal-ipi-serial-2of2	88d6ced25b6b44592780cc5014c0bee8765866bb	link	false	`/test e2e-metal-ipi-serial-2of2`
ci/prow/e2e-metal-ipi-serial-ovn-ipv6-2of2	88d6ced25b6b44592780cc5014c0bee8765866bb	link	false	`/test e2e-metal-ipi-serial-ovn-ipv6-2of2`
ci/prow/e2e-aws-proxy	88d6ced25b6b44592780cc5014c0bee8765866bb	link	false	`/test e2e-aws-proxy`
ci/prow/e2e-aws-ovn-single-node-upgrade	88d6ced25b6b44592780cc5014c0bee8765866bb	link	false	`/test e2e-aws-ovn-single-node-upgrade`
ci/prow/e2e-aws-ovn-kube-apiserver-rollout	88d6ced25b6b44592780cc5014c0bee8765866bb	link	false	`/test e2e-aws-ovn-kube-apiserver-rollout`
ci/prow/e2e-metal-ipi-ovn-dualstack-local-gateway	88d6ced25b6b44592780cc5014c0bee8765866bb	link	false	`/test e2e-metal-ipi-ovn-dualstack-local-gateway`
ci/prow/e2e-openstack-ovn	88d6ced25b6b44592780cc5014c0bee8765866bb	link	false	`/test e2e-openstack-ovn`
ci/prow/e2e-metal-ipi-serial-ovn-ipv6-1of2	88d6ced25b6b44592780cc5014c0bee8765866bb	link	false	`/test e2e-metal-ipi-serial-ovn-ipv6-1of2`
ci/prow/e2e-metal-ipi-ovn	88d6ced25b6b44592780cc5014c0bee8765866bb	link	false	`/test e2e-metal-ipi-ovn`
ci/prow/e2e-metal-ipi-ovn-dualstack	88d6ced25b6b44592780cc5014c0bee8765866bb	link	false	`/test e2e-metal-ipi-ovn-dualstack`
ci/prow/e2e-gcp-csi	88d6ced25b6b44592780cc5014c0bee8765866bb	link	true	`/test e2e-gcp-csi`
ci/prow/e2e-aws-csi	88d6ced25b6b44592780cc5014c0bee8765866bb	link	true	`/test e2e-aws-csi`
ci/prow/go-verify-deps	88d6ced25b6b44592780cc5014c0bee8765866bb	link	true	`/test go-verify-deps`
ci/prow/e2e-aws-ovn-microshift	88d6ced25b6b44592780cc5014c0bee8765866bb	link	true	`/test e2e-aws-ovn-microshift`
ci/prow/e2e-aws-ovn-microshift-serial	88d6ced25b6b44592780cc5014c0bee8765866bb	link	true	`/test e2e-aws-ovn-microshift-serial`
ci/prow/e2e-metal-ipi-ovn-ipv6	88d6ced25b6b44592780cc5014c0bee8765866bb	link	true	`/test e2e-metal-ipi-ovn-ipv6`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Nov 18 '25 12:11 openshift-ci[bot]

Job Failure Risk Analysis for sha: 88d6ced25b6b44592780cc5014c0bee8765866bb

Job Name	Failure Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-single-node-upgrade	Medium Job run should complete before timeout This test has passed 91.46% of 5282 runs on release 4.21 [Overall] in the last week.
pull-ci-openshift-origin-main-e2e-metal-ipi-ovn-ipv6	IncompleteTests Tests for this run (2) are below the historical average (2444): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

Nov 18 '25 13:11 openshift-trt[bot]