origin icon indicating copy to clipboard operation
origin copied to clipboard

OCPBUGS-26601: Re-enable test/extended/router/http2 tests on AWS

Open frobware opened this issue 1 year ago • 28 comments

It's been a long time since we disabled these tests on AWS. I have been running the http2 tests on AWS all week and I haven't run into the issue once. Let's re-enable the http2 x AWS tests for better coverage.

This PR also addresses an intermittent issue encountered in AWS environments during the router's h2spec conformance tests. The challenge involved slower hostname resolution within the cluster, resulting in frequent timeouts. Notably, AWS exhibited slower resolution times compared to Azure or GCP, hinting at potential differences in DNS handling.

The solution implemented in this PR focuses on resolving the hostname on the test host before initiating the h2spec tests within the cluster. This adjustment has resulted in a remarkable improvement in test execution speed, with the h2spec test now completing in approximately 85 seconds, a significant reduction from the previous average of over 376 seconds (just above the 5-minute mark).

While the difference in resolution times suggests environmental variations, particularly in AWS, it's important to note that this PR does not definitively attribute the issue to negative caching. Instead, it prioritises the substantial improvement achieved through the new approach. As a precaution, the polling interval and overall test timeout have been adjusted to 2 seconds and 10 minutes, respectively, to enhance test success rates across diverse cloud environments.

This PR represents a practical win in terms of improved test efficiency, while acknowledging potential environmental differences for further investigation, if needed, in the future.

Original bug: https://bugzilla.redhat.com/show_bug.cgi?id=1912413

frobware avatar Jan 10 '24 15:01 frobware

@frobware: This pull request references Jira Issue OCPBUGS-26601, which is invalid:

  • expected the bug to target the "4.16.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

It's been a long time since we disabled these tests on AWS. I have been running the http2 tests on AWS all week and I haven't run into the issue once. Let's re-enable the http2 x AWS tests for better coverage.

Original bug: https://bugzilla.redhat.com/show_bug.cgi?id=1912413

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Jan 10 '24 15:01 openshift-ci-robot

/jira refresh

frobware avatar Jan 10 '24 15:01 frobware

@frobware: This pull request references Jira Issue OCPBUGS-26601, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @lihongan

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Jan 10 '24 15:01 openshift-ci-robot

See https://github.com/openshift/origin/pull/26089

/approve /lgtm

candita avatar Jan 10 '24 16:01 candita

/retest-required

Remaining retests: 0 against base HEAD e913a6484ff96d09171d9fd609096331b4b1cbfe and 2 for PR HEAD 09eb1f04ebab889fa1428b8218195528e7e276db in total

openshift-ci-robot avatar Jan 10 '24 16:01 openshift-ci-robot

/retest-required

Remaining retests: 0 against base HEAD 52f2f6b6e66a07134657845c0d4ee4d557e80af7 and 1 for PR HEAD 09eb1f04ebab889fa1428b8218195528e7e276db in total

openshift-ci-robot avatar Jan 11 '24 00:01 openshift-ci-robot

/retest-required

Remaining retests: 0 against base HEAD 663c840b1147e4eaf1e0576fc4e1c5d391e5f3ab and 0 for PR HEAD 09eb1f04ebab889fa1428b8218195528e7e276db in total

openshift-ci-robot avatar Jan 11 '24 05:01 openshift-ci-robot

/hold

Revision 09eb1f04ebab889fa1428b8218195528e7e276db was retested 3 times: holding

openshift-ci-robot avatar Jan 11 '24 08:01 openshift-ci-robot

/retest

frobware avatar Jan 11 '24 09:01 frobware

@frobware: This pull request references Jira Issue OCPBUGS-26601, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @lihongan

In response to this:

It's been a long time since we disabled these tests on AWS. I have been running the http2 tests on AWS all week and I haven't run into the issue once. Let's re-enable the http2 x AWS tests for better coverage.

This PR also addresses an intermittent issue encountered in AWS environments during the router's h2spec conformance tests. The challenge involved slower hostname resolution within the cluster, resulting in frequent timeouts. Notably, AWS exhibited slower resolution times compared to Azure or GCP, hinting at potential differences in DNS handling.

The solution implemented in this PR focuses on resolving the hostname on the test host before initiating the h2spec tests within the cluster. This adjustment has resulted in a remarkable improvement in test execution speed, with the h2spec test now completing in approximately 85 seconds, a significant reduction from the previous average of over 376 seconds (just above the 5-minute mark).

While the difference in resolution times suggests environmental variations, particularly in AWS, it's important to note that this PR does not definitively attribute the issue to negative caching. Instead, it prioritises the substantial improvement achieved through the new approach. As a precaution, the polling interval and overall test timeout have been adjusted to 2 seconds and 10 minutes, respectively, to enhance test success rates across diverse cloud environments.

This PR represents a practical win in terms of improved test efficiency, while acknowledging potential environmental differences for further investigation, if needed, in the future.

Original bug: https://bugzilla.redhat.com/show_bug.cgi?id=1912413

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Jan 11 '24 15:01 openshift-ci-robot

/test e2e-aws-ovn-upi

lihongan avatar Jan 12 '24 03:01 lihongan

@lihongan: The specified target(s) for /test were not found. The following commands are available to trigger required jobs:

  • /test e2e-aws-jenkins
  • /test e2e-aws-ovn-fips
  • /test e2e-aws-ovn-image-registry
  • /test e2e-aws-ovn-serial
  • /test e2e-gcp-ovn
  • /test e2e-gcp-ovn-builds
  • /test e2e-gcp-ovn-image-ecosystem
  • /test e2e-gcp-ovn-upgrade
  • /test e2e-metal-ipi-ovn-ipv6
  • /test images
  • /test lint
  • /test unit
  • /test verify
  • /test verify-deps

The following commands are available to trigger optional jobs:

  • /test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback
  • /test e2e-agnostic-ovn-cmd
  • /test e2e-aws
  • /test e2e-aws-csi
  • /test e2e-aws-disruptive
  • /test e2e-aws-etcd-recovery
  • /test e2e-aws-multitenant
  • /test e2e-aws-ovn
  • /test e2e-aws-ovn-cgroupsv2
  • /test e2e-aws-ovn-etcd-scaling
  • /test e2e-aws-ovn-kubevirt
  • /test e2e-aws-ovn-single-node
  • /test e2e-aws-ovn-single-node-serial
  • /test e2e-aws-ovn-single-node-upgrade
  • /test e2e-aws-ovn-upgrade
  • /test e2e-aws-proxy
  • /test e2e-azure
  • /test e2e-azure-ovn-etcd-scaling
  • /test e2e-baremetalds-kubevirt
  • /test e2e-gcp-csi
  • /test e2e-gcp-disruptive
  • /test e2e-gcp-fips-serial
  • /test e2e-gcp-ovn-etcd-scaling
  • /test e2e-gcp-ovn-rt-upgrade
  • /test e2e-gcp-ovn-techpreview
  • /test e2e-gcp-ovn-techpreview-serial
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-metal-ipi-sdn
  • /test e2e-metal-ipi-serial
  • /test e2e-metal-ipi-serial-ovn-ipv6
  • /test e2e-metal-ipi-virtualmedia
  • /test e2e-openstack-ovn
  • /test e2e-openstack-serial
  • /test e2e-vsphere
  • /test e2e-vsphere-ovn-etcd-scaling
  • /test okd-e2e-gcp

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-origin-master-e2e-agnostic-ovn-cmd
  • pull-ci-openshift-origin-master-e2e-aws-csi
  • pull-ci-openshift-origin-master-e2e-aws-ovn-cgroupsv2
  • pull-ci-openshift-origin-master-e2e-aws-ovn-fips
  • pull-ci-openshift-origin-master-e2e-aws-ovn-serial
  • pull-ci-openshift-origin-master-e2e-aws-ovn-single-node
  • pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial
  • pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade
  • pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade
  • pull-ci-openshift-origin-master-e2e-gcp-csi
  • pull-ci-openshift-origin-master-e2e-gcp-ovn
  • pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade
  • pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade
  • pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6
  • pull-ci-openshift-origin-master-e2e-metal-ipi-sdn
  • pull-ci-openshift-origin-master-e2e-openstack-ovn
  • pull-ci-openshift-origin-master-images
  • pull-ci-openshift-origin-master-lint
  • pull-ci-openshift-origin-master-unit
  • pull-ci-openshift-origin-master-verify
  • pull-ci-openshift-origin-master-verify-deps

In response to this:

/test e2e-aws-ovn-upi

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci[bot] avatar Jan 12 '24 03:01 openshift-ci[bot]

This probably requires https://github.com/openshift/cloud-provider-aws/pull/57

candita avatar Jan 12 '24 17:01 candita

/retest-required

candita avatar Feb 15 '24 19:02 candita

/lgtm

candita avatar Feb 15 '24 20:02 candita

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: candita, frobware

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Feb 15 '24 20:02 openshift-ci[bot]

/unhold

candita avatar Feb 16 '24 00:02 candita

/retest-required

Remaining retests: 0 against base HEAD 7812f3cbbadfff5e7570f48b2dce3331ab24b729 and 2 for PR HEAD 00ea63b861570609c0bd7b02254c303772ea5b33 in total

openshift-ci-robot avatar Feb 16 '24 01:02 openshift-ci-robot

/hold

I think the consensus was that this PR still requires https://github.com/openshift/cloud-provider-aws/pull/57.

frobware avatar Feb 16 '24 10:02 frobware

Slack discussion: https://redhat-internal.slack.com/archives/CBWMXQJKD/p1704908895477469.

frobware avatar May 09 '24 08:05 frobware

/hold

I think the consensus was that this PR still requires openshift/cloud-provider-aws#57.

57^ has merged.

/test all

frobware avatar May 09 '24 15:05 frobware

/retest

frobware avatar May 10 '24 11:05 frobware

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

openshift-bot avatar May 17 '24 09:05 openshift-bot

@openshift-bot: This pull request references Jira Issue OCPBUGS-26601, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @lihongan

In response to this:

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar May 17 '24 09:05 openshift-ci-robot

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

openshift-bot avatar May 17 '24 15:05 openshift-bot

@openshift-bot: This pull request references Jira Issue OCPBUGS-26601, which is invalid:

  • expected the bug to target either version "4.17." or "openshift-4.17.", but it targets "4.16.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target different OCP), recalculating validity.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar May 17 '24 15:05 openshift-ci-robot

/test all

frobware avatar May 22 '24 07:05 frobware

/test all

frobware avatar Jun 10 '24 08:06 frobware

failed to initialize the cluster: Cluster operators authentication, console, control-plane-machine-set, image-registry, ingress, machine-api, monitoring are not available

/test e2e-gcp-ovn-upgrade

frobware avatar Jun 11 '24 10:06 frobware

/hold cancel

From https://redhat-internal.slack.com/archives/CBWMXQJKD/p1715681553819319?thread_ts=1704908895.477469&cid=CBWMXQJKD

we still don't have an accepted nightly build that including the pr, but I ran flexy job and installed one aws upi cluster with 4.16.0-0.nightly-2024-05-13-102953 (rejected) today. I believe the issue is fixed, ingresscontroller as well as LB service can be deleted within about 1'20'' and no k8s rules leaking in Security Groups.

cc @lihongan

frobware avatar Jun 11 '24 12:06 frobware