ETCD-683: Add etcd-backup-server with DaemosSet e2e test
resolves https://issues.redhat.com/browse/ETCD-683
cc @openshift/openshift-team-etcd
@Elbehery: This pull request references ETCD-683 which is a valid jira issue.
In response to this:
resolves https://issues.redhat.com/browse/ETCD-683
cc @openshift/openshift-team-etcd
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Elbehery Once this PR has been reviewed and has the lgtm label, please assign dennisperiquet for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
/payload-job periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview
@Elbehery: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
- periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7a823ff0-88d6-11ef-9b8b-6a69b0c897db-0
I have been trying to run this against an OCP-4.18 && https://github.com/openshift/cluster-etcd-operator/pull/1354 atop
The test does not execute, it does not even apply the Backup CR !!!!
Below are logs
All monitor tests started.
started: 0/1/1 "[sig-etcd][OCPFeatureGate:AutomatedEtcdBackup][Suite:openshift/etcd/recovery] etcd @Mustafa - is able to apply automated backup daemonSet no-config configuration [Timeout:70m][apigroup:config.openshift.io]"
I1013 09:58:28.471570 15821 reflector.go:305] Starting reflector *v1.Pod (10m0s) from k8s.io/[email protected]/tools/cache/reflector.go:243
I1013 09:58:28.471631 15821 reflector.go:341] Listing and watching *v1.Pod from k8s.io/[email protected]/tools/cache/reflector.go:243
I1013 09:58:28.677468 15821 reflector.go:368] Caches populated for *v1.Pod from k8s.io/[email protected]/tools/cache/reflector.go:243
passed: (1m23s) 2024-10-13T07:59:51 "[sig-etcd][OCPFeatureGate:AutomatedEtcdBackup][Suite:openshift/etcd/recovery] etcd @Mustafa - is able to apply automated backup daemonSet no-config configuration [Timeout:70m][apigroup:config.openshift.io]"
Shutting down the monitor
Collecting data.
This test is almost identical to https://github.com/openshift/origin/blob/master/test/extended/etcd/etcd_backup_noconfig.go
I tried running https://github.com/openshift/origin/blob/master/test/extended/etcd/etcd_backup_noconfig.go and it works.
However, in this current test, the same steps are not being executed by openshift-tests
Job Failure Risk Analysis for sha: c71be68efa5bdd93fa0b0f3ca71a3656d87cd048
| Job Name | Failure Risk |
|---|---|
| pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout | Low [Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io] This test has passed 41.67% of 12 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-kube-apiserver-rollout'] in the last 14 days. |
tested against OCP 4.18.0-0.ci-2024-10-13-203438 && openshift/cluster-etcd-operator#1354 atop
backups are identical in all master nodes after the test finished successfully .. also the test clean up was successful
melbeher@melbeher-mac origin % oc debug node/ip-10-0-90-242.ec2.internal etcd_noconfig_backup_daemonset_e2e_test
Starting pod/ip-10-0-90-242ec2internal-debug-6ncp7 ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.90.242
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# ls -l /var/lib/etcd-auto-backup/
total 0
drwxr-xr-x. 2 root root 96 Oct 14 10:36 2024-10-14_103600
drwxr-xr-x. 2 root root 96 Oct 14 10:37 2024-10-14_103700
drwxr-xr-x. 2 root root 96 Oct 14 10:38 2024-10-14_103800
melbeher@melbeher-mac origin % oc debug node/ip-10-0-100-248.ec2.internal etcd_noconfig_backup_daemonset_e2e_test
Starting pod/ip-10-0-100-248ec2internal-debug-btnhn ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.100.248
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# ls -l /var/lib/etcd-auto-backup/
total 0
drwxr-xr-x. 2 root root 96 Oct 14 10:36 2024-10-14_103600
drwxr-xr-x. 2 root root 96 Oct 14 10:37 2024-10-14_103700
drwxr-xr-x. 2 root root 96 Oct 14 10:38 2024-10-14_103800
melbeher@melbeher-mac origin % oc debug node/ip-10-0-27-149.ec2.internal etcd_noconfig_backup_daemonset_e2e_test
Starting pod/ip-10-0-27-149ec2internal-debug-nkfss ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.27.149
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1#
sh-5.1# ls -l /var/lib/etcd-auto-backup/
total 0
drwxr-xr-x. 2 root root 96 Oct 14 10:36 2024-10-14_103600
drwxr-xr-x. 2 root root 96 Oct 14 10:37 2024-10-14_103700
drwxr-xr-x. 2 root root 96 Oct 14 10:38 2024-10-14_103800
oc get all -n openshift-etcd
Warning: apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+
NAME READY STATUS RESTARTS AGE
pod/etcd-guard-ip-10-0-100-248.ec2.internal 1/1 Running 0 109m
pod/etcd-guard-ip-10-0-27-149.ec2.internal 1/1 Running 0 120m
pod/etcd-guard-ip-10-0-90-242.ec2.internal 1/1 Running 0 115m
pod/etcd-ip-10-0-100-248.ec2.internal 6/6 Running 6 145m
pod/etcd-ip-10-0-27-149.ec2.internal 6/6 Running 6 144m
pod/etcd-ip-10-0-90-242.ec2.internal 6/6 Running 6 142m
pod/revision-pruner-7-ip-10-0-100-248.ec2.internal 0/1 Completed 0 111m
pod/revision-pruner-7-ip-10-0-27-149.ec2.internal 0/1 Completed 0 121m
pod/revision-pruner-7-ip-10-0-90-242.ec2.internal 0/1 Completed 0 116m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/etcd ClusterIP 172.30.173.110 <none> 2379/TCP,9979/TCP 156m
/payload-job periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview
@Elbehery: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
- periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/92507070-8a1b-11ef-84e6-247f6767bad0-0
/label tide/merge-method-squash
/retest-required
/retest-required
/retest-required
Job Failure Risk Analysis for sha: ee11f2a46f4de693524ee2faa2e1b4931dfad9d4
| Job Name | Failure Risk |
|---|---|
| pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout | IncompleteTests Tests for this run (101) are below the historical average (1195): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems) |
| pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout | Low [Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io] This test has passed 33.33% of 12 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-kube-apiserver-rollout'] in the last 14 days. |
Job Failure Risk Analysis for sha: 19b1a75dd763bbda6ac730db1374b971949d6e9a
| Job Name | Failure Risk |
|---|---|
| pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade | High [sig-arch] events should not repeat pathologically for ns/openshift-machine-config-operator This test has passed 99.53% of 212 runs on release 4.18 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week. |
/retest-required
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle stale
/remove-lifecycle stale
Job Failure Risk Analysis for sha: 04b9e3a7881916eb071550b5cf33c132ce5b7d72
| Job Name | Failure Risk |
|---|---|
| pull-ci-openshift-origin-master-e2e-aws-ovn-single-node | High [sig-arch] events should not repeat pathologically for ns/openshift-kube-apiserver This test has passed 100.00% of 51 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-single-node'] in the last 14 days. |
@Elbehery: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| ci/prow/e2e-aws-ovn-ipsec-serial | ee11f2a46f4de693524ee2faa2e1b4931dfad9d4 | link | false | /test e2e-aws-ovn-ipsec-serial |
| ci/prow/e2e-aws-ovn-single-node | 04b9e3a7881916eb071550b5cf33c132ce5b7d72 | link | false | /test e2e-aws-ovn-single-node |
| ci/prow/e2e-metal-ipi-ovn-ipv6 | 04b9e3a7881916eb071550b5cf33c132ce5b7d72 | link | true | /test e2e-metal-ipi-ovn-ipv6 |
| ci/prow/e2e-vsphere-ovn-upi | 04b9e3a7881916eb071550b5cf33c132ce5b7d72 | link | true | /test e2e-vsphere-ovn-upi |
| ci/prow/e2e-aws-ovn-serial-2of2 | 04b9e3a7881916eb071550b5cf33c132ce5b7d72 | link | true | /test e2e-aws-ovn-serial-2of2 |
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle rotten /remove-lifecycle stale
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.
/close
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
@openshift-bot: Closed this PR.
In response to this:
Rotten issues close after 30d of inactivity.
Reopen the issue by commenting
/reopen. Mark the issue as fresh by commenting/remove-lifecycle rotten. Exclude this issue from closing again by commenting/lifecycle frozen./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.