origin ETCD-683: Add etcd-backup-server with DaemosSet e2e test

resolves https://issues.redhat.com/browse/ETCD-683

cc @openshift/openshift-team-etcd

Oct 12 '24 20:10 Elbehery

@Elbehery: This pull request references ETCD-683 which is a valid jira issue.

In response to this:

resolves https://issues.redhat.com/browse/ETCD-683

cc @openshift/openshift-team-etcd

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Oct 12 '24 20:10 openshift-ci-robot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Elbehery Once this PR has been reviewed and has the lgtm label, please assign dennisperiquet for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Oct 12 '24 20:10 openshift-ci[bot]

/payload-job periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview

Oct 12 '24 20:10 Elbehery

@Elbehery: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7a823ff0-88d6-11ef-9b8b-6a69b0c897db-0

Oct 12 '24 20:10 openshift-ci[bot]

I have been trying to run this against an OCP-4.18 && https://github.com/openshift/cluster-etcd-operator/pull/1354 atop

The test does not execute, it does not even apply the Backup CR !!!!

Below are logs

All monitor tests started.
started: 0/1/1 "[sig-etcd][OCPFeatureGate:AutomatedEtcdBackup][Suite:openshift/etcd/recovery] etcd @Mustafa - is able to apply automated backup daemonSet no-config configuration [Timeout:70m][apigroup:config.openshift.io]"

  I1013 09:58:28.471570   15821 reflector.go:305] Starting reflector *v1.Pod (10m0s) from k8s.io/[email protected]/tools/cache/reflector.go:243
  I1013 09:58:28.471631   15821 reflector.go:341] Listing and watching *v1.Pod from k8s.io/[email protected]/tools/cache/reflector.go:243
  I1013 09:58:28.677468   15821 reflector.go:368] Caches populated for *v1.Pod from k8s.io/[email protected]/tools/cache/reflector.go:243
passed: (1m23s) 2024-10-13T07:59:51 "[sig-etcd][OCPFeatureGate:AutomatedEtcdBackup][Suite:openshift/etcd/recovery] etcd @Mustafa - is able to apply automated backup daemonSet no-config configuration [Timeout:70m][apigroup:config.openshift.io]"

Shutting down the monitor
Collecting data.

This test is almost identical to https://github.com/openshift/origin/blob/master/test/extended/etcd/etcd_backup_noconfig.go

I tried running https://github.com/openshift/origin/blob/master/test/extended/etcd/etcd_backup_noconfig.go and it works.

However, in this current test, the same steps are not being executed by openshift-tests

Oct 13 '24 08:10 Elbehery

Job Failure Risk Analysis for sha: c71be68efa5bdd93fa0b0f3ca71a3656d87cd048

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout	Low [Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io] This test has passed 41.67% of 12 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-kube-apiserver-rollout'] in the last 14 days.

Oct 13 '24 12:10 openshift-trt-bot

tested against OCP 4.18.0-0.ci-2024-10-13-203438 && openshift/cluster-etcd-operator#1354 atop

backups are identical in all master nodes after the test finished successfully .. also the test clean up was successful

melbeher@melbeher-mac origin % oc debug node/ip-10-0-90-242.ec2.internal               etcd_noconfig_backup_daemonset_e2e_test
Starting pod/ip-10-0-90-242ec2internal-debug-6ncp7 ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.90.242
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# ls -l /var/lib/etcd-auto-backup/
total 0
drwxr-xr-x. 2 root root 96 Oct 14 10:36 2024-10-14_103600
drwxr-xr-x. 2 root root 96 Oct 14 10:37 2024-10-14_103700
drwxr-xr-x. 2 root root 96 Oct 14 10:38 2024-10-14_103800

melbeher@melbeher-mac origin % oc debug node/ip-10-0-100-248.ec2.internal              etcd_noconfig_backup_daemonset_e2e_test
Starting pod/ip-10-0-100-248ec2internal-debug-btnhn ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.100.248
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# ls -l /var/lib/etcd-auto-backup/
total 0
drwxr-xr-x. 2 root root 96 Oct 14 10:36 2024-10-14_103600
drwxr-xr-x. 2 root root 96 Oct 14 10:37 2024-10-14_103700
drwxr-xr-x. 2 root root 96 Oct 14 10:38 2024-10-14_103800

melbeher@melbeher-mac origin % oc debug node/ip-10-0-27-149.ec2.internal               etcd_noconfig_backup_daemonset_e2e_test
Starting pod/ip-10-0-27-149ec2internal-debug-nkfss ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.27.149
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# 
sh-5.1# ls -l /var/lib/etcd-auto-backup/
total 0
drwxr-xr-x. 2 root root 96 Oct 14 10:36 2024-10-14_103600
drwxr-xr-x. 2 root root 96 Oct 14 10:37 2024-10-14_103700
drwxr-xr-x. 2 root root 96 Oct 14 10:38 2024-10-14_103800

oc get all -n openshift-etcd                       
Warning: apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+
NAME                                                 READY   STATUS      RESTARTS   AGE
pod/etcd-guard-ip-10-0-100-248.ec2.internal          1/1     Running     0          109m
pod/etcd-guard-ip-10-0-27-149.ec2.internal           1/1     Running     0          120m
pod/etcd-guard-ip-10-0-90-242.ec2.internal           1/1     Running     0          115m
pod/etcd-ip-10-0-100-248.ec2.internal                6/6     Running     6          145m
pod/etcd-ip-10-0-27-149.ec2.internal                 6/6     Running     6          144m
pod/etcd-ip-10-0-90-242.ec2.internal                 6/6     Running     6          142m
pod/revision-pruner-7-ip-10-0-100-248.ec2.internal   0/1     Completed   0          111m
pod/revision-pruner-7-ip-10-0-27-149.ec2.internal    0/1     Completed   0          121m
pod/revision-pruner-7-ip-10-0-90-242.ec2.internal    0/1     Completed   0          116m

NAME           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
service/etcd   ClusterIP   172.30.173.110   <none>        2379/TCP,9979/TCP   156m

Oct 14 '24 10:10 Elbehery

/payload-job periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview

Oct 14 '24 11:10 Elbehery

@Elbehery: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/92507070-8a1b-11ef-84e6-247f6767bad0-0

Oct 14 '24 11:10 openshift-ci[bot]

/label tide/merge-method-squash

Oct 14 '24 11:10 Elbehery

/retest-required

Oct 14 '24 17:10 Elbehery

/retest-required

Oct 14 '24 20:10 Elbehery

/retest-required

Oct 14 '24 20:10 Elbehery

Job Failure Risk Analysis for sha: ee11f2a46f4de693524ee2faa2e1b4931dfad9d4

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout	IncompleteTests Tests for this run (101) are below the historical average (1195): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout	Low [Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io] This test has passed 33.33% of 12 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-kube-apiserver-rollout'] in the last 14 days.

Oct 14 '24 20:10 openshift-trt-bot

Job Failure Risk Analysis for sha: 19b1a75dd763bbda6ac730db1374b971949d6e9a

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade	High [sig-arch] events should not repeat pathologically for ns/openshift-machine-config-operator This test has passed 99.53% of 212 runs on release 4.18 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.

Oct 16 '24 10:10 openshift-trt-bot

/retest-required

Oct 22 '24 14:10 Elbehery

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Jan 21 '25 01:01 openshift-bot

/remove-lifecycle stale

Feb 11 '25 15:02 Elbehery

Job Failure Risk Analysis for sha: 04b9e3a7881916eb071550b5cf33c132ce5b7d72

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node	High [sig-arch] events should not repeat pathologically for ns/openshift-kube-apiserver This test has passed 100.00% of 51 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-single-node'] in the last 14 days.

Feb 12 '25 23:02 openshift-trt[bot]

@Elbehery: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-ipsec-serial	ee11f2a46f4de693524ee2faa2e1b4931dfad9d4	link	false	`/test e2e-aws-ovn-ipsec-serial`
ci/prow/e2e-aws-ovn-single-node	04b9e3a7881916eb071550b5cf33c132ce5b7d72	link	false	`/test e2e-aws-ovn-single-node`
ci/prow/e2e-metal-ipi-ovn-ipv6	04b9e3a7881916eb071550b5cf33c132ce5b7d72	link	true	`/test e2e-metal-ipi-ovn-ipv6`
ci/prow/e2e-vsphere-ovn-upi	04b9e3a7881916eb071550b5cf33c132ce5b7d72	link	true	`/test e2e-vsphere-ovn-upi`
ci/prow/e2e-aws-ovn-serial-2of2	04b9e3a7881916eb071550b5cf33c132ce5b7d72	link	true	`/test e2e-aws-ovn-serial-2of2`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

May 02 '25 23:05 openshift-ci[bot]

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Aug 01 '25 09:08 openshift-bot

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

Sep 01 '25 00:09 openshift-bot

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Oct 01 '25 08:10 openshift-bot

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Oct 01 '25 08:10 openshift-merge-robot

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Oct 01 '25 08:10 openshift-ci[bot]