origin icon indicating copy to clipboard operation
origin copied to clipboard

ETCD-565: add manual etcd signer cert rotation e2e test

Open tjungblu opened this issue 1 year ago • 13 comments

This PR adds a suite of tests related to rotation of etcd certificates.

tjungblu avatar Apr 03 '24 09:04 tjungblu

@tjungblu: This pull request references ETCD-565 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Apr 03 '24 10:04 openshift-ci-robot

@tjungblu: This pull request references ETCD-565 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

In response to this:

This PR adds a suite of tests related to rotation of etcd certificates.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Apr 03 '24 10:04 openshift-ci-robot

/test ?

tjungblu avatar Apr 04 '24 12:04 tjungblu

@tjungblu: The following commands are available to trigger required jobs:

  • /test e2e-aws-jenkins
  • /test e2e-aws-ovn-fips
  • /test e2e-aws-ovn-image-registry
  • /test e2e-aws-ovn-serial
  • /test e2e-gcp-ovn
  • /test e2e-gcp-ovn-builds
  • /test e2e-gcp-ovn-image-ecosystem
  • /test e2e-gcp-ovn-upgrade
  • /test e2e-metal-ipi-ovn-ipv6
  • /test images
  • /test lint
  • /test unit
  • /test verify
  • /test verify-deps

The following commands are available to trigger optional jobs:

  • /test 4.12-upgrade-from-stable-4.11-e2e-aws-ovn-upgrade-rollback
  • /test e2e-agnostic-ovn-cmd
  • /test e2e-aws
  • /test e2e-aws-csi
  • /test e2e-aws-disruptive
  • /test e2e-aws-etcd-recovery
  • /test e2e-aws-multitenant
  • /test e2e-aws-ovn
  • /test e2e-aws-ovn-cgroupsv2
  • /test e2e-aws-ovn-etcd-scaling
  • /test e2e-aws-ovn-kubevirt
  • /test e2e-aws-ovn-single-node
  • /test e2e-aws-ovn-single-node-serial
  • /test e2e-aws-ovn-single-node-upgrade
  • /test e2e-aws-ovn-upgrade
  • /test e2e-aws-ovn-upi
  • /test e2e-aws-proxy
  • /test e2e-azure
  • /test e2e-azure-ovn-etcd-scaling
  • /test e2e-baremetalds-kubevirt
  • /test e2e-gcp-csi
  • /test e2e-gcp-disruptive
  • /test e2e-gcp-fips-serial
  • /test e2e-gcp-ovn-etcd-scaling
  • /test e2e-gcp-ovn-rt-upgrade
  • /test e2e-gcp-ovn-techpreview
  • /test e2e-gcp-ovn-techpreview-serial
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-metal-ipi-ovn-dualstack-local-gateway
  • /test e2e-metal-ipi-sdn
  • /test e2e-metal-ipi-serial
  • /test e2e-metal-ipi-serial-ovn-ipv6
  • /test e2e-metal-ipi-virtualmedia
  • /test e2e-openstack-ovn
  • /test e2e-openstack-serial
  • /test e2e-vsphere
  • /test e2e-vsphere-ovn-dualstack-primaryv6
  • /test e2e-vsphere-ovn-etcd-scaling
  • /test okd-e2e-gcp

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-origin-master-e2e-agnostic-ovn-cmd
  • pull-ci-openshift-origin-master-e2e-aws-csi
  • pull-ci-openshift-origin-master-e2e-aws-ovn-cgroupsv2
  • pull-ci-openshift-origin-master-e2e-aws-ovn-fips
  • pull-ci-openshift-origin-master-e2e-aws-ovn-serial
  • pull-ci-openshift-origin-master-e2e-aws-ovn-single-node
  • pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial
  • pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade
  • pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade
  • pull-ci-openshift-origin-master-e2e-gcp-csi
  • pull-ci-openshift-origin-master-e2e-gcp-ovn
  • pull-ci-openshift-origin-master-e2e-gcp-ovn-builds
  • pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade
  • pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade
  • pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6
  • pull-ci-openshift-origin-master-e2e-metal-ipi-sdn
  • pull-ci-openshift-origin-master-e2e-openstack-ovn
  • pull-ci-openshift-origin-master-images
  • pull-ci-openshift-origin-master-lint
  • pull-ci-openshift-origin-master-unit
  • pull-ci-openshift-origin-master-verify
  • pull-ci-openshift-origin-master-verify-deps

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci[bot] avatar Apr 04 '24 12:04 openshift-ci[bot]

/test e2e-aws-disruptive /test e2e-aws-etcd-recovery

tjungblu avatar Apr 04 '24 12:04 tjungblu

Job Failure Risk Analysis for sha: 47eef202b7733ab1abca86bcd34c627243ac5373

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6 IncompleteTests
Tests for this run (100) are below the historical average (1099): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial Medium
[sig-arch] events should not repeat pathologically for ns/openshift-authentication-operator
This test has passed 90.62% of 64 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.

Open Bugs
Auth operator capable of firing over 100 events in seconds on OpenShiftAPICheckFailed

openshift-trt-bot avatar Apr 04 '24 15:04 openshift-trt-bot

/retest

tjungblu avatar Apr 17 '24 14:04 tjungblu

Job Failure Risk Analysis for sha: 87b3bada7916590c754160f39fddc4e574b2c840

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial High
[sig-api-machinery] disruption/cache-kube-api connection/reused should be available throughout the test
This test has passed 100.00% of 70 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.
---
[sig-api-machinery] disruption/cache-openshift-api connection/reused should be available throughout the test
This test has passed 100.00% of 70 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.
---
[sig-api-machinery] disruption/cache-oauth-api connection/new should be available throughout the test
This test has passed 100.00% of 70 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.
---
[sig-api-machinery] disruption/oauth-api connection/new should be available throughout the test
This test has passed 100.00% of 70 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.
---
Showing 4 of 12 test results
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6 IncompleteTests
Tests for this run (98) are below the historical average (745): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

openshift-trt-bot avatar Apr 17 '24 21:04 openshift-trt-bot

/retest

soltysh avatar Apr 19 '24 14:04 soltysh

Job Failure Risk Analysis for sha: 5b1e7b863809d50a5f62cdfccaf4cadec5ff1873

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial High
[sig-api-machinery] disruption/kube-api connection/reused should be available throughout the test
This test has passed 100.00% of 64 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.
---
[sig-api-machinery] disruption/openshift-api connection/reused should be available throughout the test
This test has passed 100.00% of 64 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.
---
[sig-api-machinery] disruption/oauth-api connection/new should be available throughout the test
This test has passed 100.00% of 64 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.
---
[sig-api-machinery] disruption/cache-oauth-api connection/reused should be available throughout the test
This test has passed 100.00% of 64 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.
---
Showing 4 of 12 test results
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6 IncompleteTests
Tests for this run (18) are below the historical average (840): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)

openshift-trt-bot avatar Apr 19 '24 18:04 openshift-trt-bot

Job Failure Risk Analysis for sha: bc71c2d79959bea1ffac25f35f6b84b61dd4f794

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-serial Medium
[sig-arch] events should not repeat pathologically for ns/openshift-etcd
This test has passed 80.85% of 47 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node-serial'] in the last 14 days.

openshift-trt-bot avatar Apr 29 '24 11:04 openshift-trt-bot

/test e2e-aws-etcd-recovery

tjungblu avatar May 22 '24 09:05 tjungblu

Job Failure Risk Analysis for sha: eb745771fe0e97ecfc88a7d4fc16f8351132012e

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade High
[sig-apps] job-upgrade
This test has passed 100.00% of 272 runs on jobs ['periodic-ci-openshift-release-master-ci-4.17-e2e-aws-ovn-upgrade'] in the last 14 days.
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade High
[sig-arch] events should not repeat pathologically for ns/openshift-kube-apiserver-operator
This test has passed 99.63% of 3780 runs on release 4.17 [Overall] in the last week.
---
[sig-arch] events should not repeat pathologically for ns/openshift-etcd-operator
This test has passed 99.66% of 3780 runs on release 4.17 [Overall] in the last week.
---
[bz-Node Tuning Operator] clusteroperator/node-tuning should not change condition/Available
This test has passed 99.63% of 3784 runs on release 4.17 [Overall] in the last week.

openshift-trt-bot avatar May 22 '24 12:05 openshift-trt-bot

/test e2e-aws-etcd-recovery

tjungblu avatar May 22 '24 13:05 tjungblu

/test e2e-aws-etcd-recovery

tjungblu avatar May 22 '24 13:05 tjungblu

/test e2e-aws-etcd-recovery

tjungblu avatar May 22 '24 13:05 tjungblu

/test e2e-aws-etcd-recovery

tjungblu avatar May 22 '24 15:05 tjungblu

/cherry-pick release-4.16

tjungblu avatar May 22 '24 15:05 tjungblu

@tjungblu: once the present PR merges, I will cherry-pick it on top of release-4.16 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

/test e2e-aws-etcd-recovery

tjungblu avatar May 22 '24 15:05 tjungblu

Job Failure Risk Analysis for sha: 19dbf04142214a0351460392f4941ae43dbdff30

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade High
[sig-scheduling][Early] The openshift-console console pods [apigroup:console.openshift.io] should be scheduled on different nodes [Suite:openshift/conformance/parallel]
This test has passed 99.35% of 3848 runs on release 4.17 [Overall] in the last week.
---
[sig-network-edge] Verify DNS availability during and after upgrade success
This test has passed 99.56% of 1590 runs on release 4.17 [Overall] in the last week.
---
[bz-Node Tuning Operator] clusteroperator/node-tuning should not change condition/Available
This test has passed 99.56% of 3906 runs on release 4.17 [Overall] in the last week.
pull-ci-openshift-origin-master-e2e-aws-etcd-recovery High
[sig-arch] events should not repeat pathologically for ns/openshift-operator-lifecycle-manager
This test has passed 99.66% of 3870 runs on release 4.17 [Overall] in the last week.
---
[sig-arch] events should not repeat pathologically
This test has passed 98.80% of 83 runs on release 4.17 [amd64 aws ha ovn] in the last week.

Open Bugs
lots of churn during image registry managed/removed transition
Excessive TopologyAwareHintsDisabled events due to service/dns-default with topology aware hints activated.
Excessive TopologyAwareHintsDisabled events due to service/dns-default with topology aware hints activated.
[4.15] "k8s.ovn.org/node-chassis-id annotation not found" event causing CI failures

openshift-trt-bot avatar May 22 '24 18:05 openshift-trt-bot

/test e2e-aws-etcd-recovery

tjungblu avatar May 23 '24 07:05 tjungblu

While not strictly disruptive, we can put the cert rotation tests in the recovery suite for now.

The tests themselves look good to me. 👍 for covering the dynamic cert recreation.

Only question is the test time seems surprisingly fast. image

Is ~2mins all it takes for a revision rollout these days?

/approve

Holding in case @soltysh had a follow up to his earlier review.

hasbro17 avatar May 23 '24 07:05 hasbro17

hmm, maybe a race condition? 2m also seems too fast for me

tjungblu avatar May 23 '24 07:05 tjungblu

/hold cancel

thank you both!

tjungblu avatar May 23 '24 10:05 tjungblu

/lgtm

soltysh avatar May 23 '24 10:05 soltysh

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hasbro17, soltysh, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar May 23 '24 11:05 openshift-ci[bot]

/retest-required

Remaining retests: 0 against base HEAD f9a573a84d345a55791ff4630b1b2bc2a7233f15 and 2 for PR HEAD 57736c8b21544eb2426b2cc2afeda215e22cc92e in total

openshift-ci-robot avatar May 23 '24 11:05 openshift-ci-robot

/retest-required

tjungblu avatar May 23 '24 17:05 tjungblu

Job Failure Risk Analysis for sha: 57736c8b21544eb2426b2cc2afeda215e22cc92e

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-gcp-ovn Medium
[sig-storage] Multi-AZ Cluster Volumes should schedule pods in the same zones as statically provisioned PVs [Suite:openshift/conformance/parallel] [Suite:k8s]
This test has passed 91.60% of 714 runs on release 4.17 [Overall] in the last week.
---
[sig-storage] PersistentVolumes GCEPD [Feature:StorageProvider] should test that deleting a PVC before the pod does not cause pod deletion to fail on PD detach [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s]
This test has passed 93.29% of 715 runs on release 4.17 [Overall] in the last week.
---
[sig-storage] PersistentVolumes GCEPD [Feature:StorageProvider] should test that deleting the PV before the pod does not cause pod deletion to fail on PD detach [Skipped:NoOptionalCapabilities] [Suite:openshift/conformance/parallel] [Suite:k8s]
This test has passed 93.42% of 714 runs on release 4.17 [Overall] in the last week.

Open Bugs
4.17 ci failures: persistentvolumes "gce-" is forbidden ... GCE PD ...disk is not found

openshift-trt-bot avatar May 23 '24 20:05 openshift-trt-bot