origin icon indicating copy to clipboard operation
origin copied to clipboard

ETCD-683: Add etcd-backup-server with DaemosSet e2e test

Open Elbehery opened this issue 1 year ago • 17 comments

resolves https://issues.redhat.com/browse/ETCD-683

cc @openshift/openshift-team-etcd

Elbehery avatar Oct 12 '24 20:10 Elbehery

@Elbehery: This pull request references ETCD-683 which is a valid jira issue.

In response to this:

resolves https://issues.redhat.com/browse/ETCD-683

cc @openshift/openshift-team-etcd

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot avatar Oct 12 '24 20:10 openshift-ci-robot

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Elbehery Once this PR has been reviewed and has the lgtm label, please assign dennisperiquet for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Oct 12 '24 20:10 openshift-ci[bot]

/payload-job periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview

Elbehery avatar Oct 12 '24 20:10 Elbehery

@Elbehery: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7a823ff0-88d6-11ef-9b8b-6a69b0c897db-0

openshift-ci[bot] avatar Oct 12 '24 20:10 openshift-ci[bot]

I have been trying to run this against an OCP-4.18 && https://github.com/openshift/cluster-etcd-operator/pull/1354 atop

The test does not execute, it does not even apply the Backup CR !!!!

Below are logs

All monitor tests started.
started: 0/1/1 "[sig-etcd][OCPFeatureGate:AutomatedEtcdBackup][Suite:openshift/etcd/recovery] etcd @Mustafa - is able to apply automated backup daemonSet no-config configuration [Timeout:70m][apigroup:config.openshift.io]"

  I1013 09:58:28.471570   15821 reflector.go:305] Starting reflector *v1.Pod (10m0s) from k8s.io/[email protected]/tools/cache/reflector.go:243
  I1013 09:58:28.471631   15821 reflector.go:341] Listing and watching *v1.Pod from k8s.io/[email protected]/tools/cache/reflector.go:243
  I1013 09:58:28.677468   15821 reflector.go:368] Caches populated for *v1.Pod from k8s.io/[email protected]/tools/cache/reflector.go:243
passed: (1m23s) 2024-10-13T07:59:51 "[sig-etcd][OCPFeatureGate:AutomatedEtcdBackup][Suite:openshift/etcd/recovery] etcd @Mustafa - is able to apply automated backup daemonSet no-config configuration [Timeout:70m][apigroup:config.openshift.io]"

Shutting down the monitor
Collecting data.

This test is almost identical to https://github.com/openshift/origin/blob/master/test/extended/etcd/etcd_backup_noconfig.go

I tried running https://github.com/openshift/origin/blob/master/test/extended/etcd/etcd_backup_noconfig.go and it works.

However, in this current test, the same steps are not being executed by openshift-tests

Elbehery avatar Oct 13 '24 08:10 Elbehery

Job Failure Risk Analysis for sha: c71be68efa5bdd93fa0b0f3ca71a3656d87cd048

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout Low
[Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io]
This test has passed 41.67% of 12 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-kube-apiserver-rollout'] in the last 14 days.

openshift-trt-bot avatar Oct 13 '24 12:10 openshift-trt-bot

tested against OCP 4.18.0-0.ci-2024-10-13-203438 && openshift/cluster-etcd-operator#1354 atop

backups are identical in all master nodes after the test finished successfully .. also the test clean up was successful


melbeher@melbeher-mac origin % oc debug node/ip-10-0-90-242.ec2.internal               etcd_noconfig_backup_daemonset_e2e_test
Starting pod/ip-10-0-90-242ec2internal-debug-6ncp7 ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.90.242
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# ls -l /var/lib/etcd-auto-backup/
total 0
drwxr-xr-x. 2 root root 96 Oct 14 10:36 2024-10-14_103600
drwxr-xr-x. 2 root root 96 Oct 14 10:37 2024-10-14_103700
drwxr-xr-x. 2 root root 96 Oct 14 10:38 2024-10-14_103800

melbeher@melbeher-mac origin % oc debug node/ip-10-0-100-248.ec2.internal              etcd_noconfig_backup_daemonset_e2e_test
Starting pod/ip-10-0-100-248ec2internal-debug-btnhn ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.100.248
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# ls -l /var/lib/etcd-auto-backup/
total 0
drwxr-xr-x. 2 root root 96 Oct 14 10:36 2024-10-14_103600
drwxr-xr-x. 2 root root 96 Oct 14 10:37 2024-10-14_103700
drwxr-xr-x. 2 root root 96 Oct 14 10:38 2024-10-14_103800

melbeher@melbeher-mac origin % oc debug node/ip-10-0-27-149.ec2.internal               etcd_noconfig_backup_daemonset_e2e_test
Starting pod/ip-10-0-27-149ec2internal-debug-nkfss ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.27.149
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# 
sh-5.1# ls -l /var/lib/etcd-auto-backup/
total 0
drwxr-xr-x. 2 root root 96 Oct 14 10:36 2024-10-14_103600
drwxr-xr-x. 2 root root 96 Oct 14 10:37 2024-10-14_103700
drwxr-xr-x. 2 root root 96 Oct 14 10:38 2024-10-14_103800

oc get all -n openshift-etcd                       
Warning: apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+
NAME                                                 READY   STATUS      RESTARTS   AGE
pod/etcd-guard-ip-10-0-100-248.ec2.internal          1/1     Running     0          109m
pod/etcd-guard-ip-10-0-27-149.ec2.internal           1/1     Running     0          120m
pod/etcd-guard-ip-10-0-90-242.ec2.internal           1/1     Running     0          115m
pod/etcd-ip-10-0-100-248.ec2.internal                6/6     Running     6          145m
pod/etcd-ip-10-0-27-149.ec2.internal                 6/6     Running     6          144m
pod/etcd-ip-10-0-90-242.ec2.internal                 6/6     Running     6          142m
pod/revision-pruner-7-ip-10-0-100-248.ec2.internal   0/1     Completed   0          111m
pod/revision-pruner-7-ip-10-0-27-149.ec2.internal    0/1     Completed   0          121m
pod/revision-pruner-7-ip-10-0-90-242.ec2.internal    0/1     Completed   0          116m

NAME           TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
service/etcd   ClusterIP   172.30.173.110   <none>        2379/TCP,9979/TCP   156m

Elbehery avatar Oct 14 '24 10:10 Elbehery

/payload-job periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview

Elbehery avatar Oct 14 '24 11:10 Elbehery

@Elbehery: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/92507070-8a1b-11ef-84e6-247f6767bad0-0

openshift-ci[bot] avatar Oct 14 '24 11:10 openshift-ci[bot]

/label tide/merge-method-squash

Elbehery avatar Oct 14 '24 11:10 Elbehery

/retest-required

Elbehery avatar Oct 14 '24 17:10 Elbehery

/retest-required

Elbehery avatar Oct 14 '24 20:10 Elbehery

/retest-required

Elbehery avatar Oct 14 '24 20:10 Elbehery

Job Failure Risk Analysis for sha: ee11f2a46f4de693524ee2faa2e1b4931dfad9d4

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout IncompleteTests
Tests for this run (101) are below the historical average (1195): IncompleteTests (not enough tests ran to make a reasonable risk analysis; this could be due to infra, installation, or upgrade problems)
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout Low
[Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io]
This test has passed 33.33% of 12 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-kube-apiserver-rollout'] in the last 14 days.

openshift-trt-bot avatar Oct 14 '24 20:10 openshift-trt-bot

Job Failure Risk Analysis for sha: 19b1a75dd763bbda6ac730db1374b971949d6e9a

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade High
[sig-arch] events should not repeat pathologically for ns/openshift-machine-config-operator
This test has passed 99.53% of 212 runs on release 4.18 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.

openshift-trt-bot avatar Oct 16 '24 10:10 openshift-trt-bot

/retest-required

Elbehery avatar Oct 22 '24 14:10 Elbehery

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot avatar Jan 21 '25 01:01 openshift-bot

/remove-lifecycle stale

Elbehery avatar Feb 11 '25 15:02 Elbehery

Job Failure Risk Analysis for sha: 04b9e3a7881916eb071550b5cf33c132ce5b7d72

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node High
[sig-arch] events should not repeat pathologically for ns/openshift-kube-apiserver
This test has passed 100.00% of 51 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-single-node'] in the last 14 days.

openshift-trt[bot] avatar Feb 12 '25 23:02 openshift-trt[bot]

@Elbehery: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-ipsec-serial ee11f2a46f4de693524ee2faa2e1b4931dfad9d4 link false /test e2e-aws-ovn-ipsec-serial
ci/prow/e2e-aws-ovn-single-node 04b9e3a7881916eb071550b5cf33c132ce5b7d72 link false /test e2e-aws-ovn-single-node
ci/prow/e2e-metal-ipi-ovn-ipv6 04b9e3a7881916eb071550b5cf33c132ce5b7d72 link true /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-vsphere-ovn-upi 04b9e3a7881916eb071550b5cf33c132ce5b7d72 link true /test e2e-vsphere-ovn-upi
ci/prow/e2e-aws-ovn-serial-2of2 04b9e3a7881916eb071550b5cf33c132ce5b7d72 link true /test e2e-aws-ovn-serial-2of2

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci[bot] avatar May 02 '25 23:05 openshift-ci[bot]

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot avatar Aug 01 '25 09:08 openshift-bot

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot avatar Sep 01 '25 00:09 openshift-bot

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-bot avatar Oct 01 '25 08:10 openshift-bot

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-merge-robot avatar Oct 01 '25 08:10 openshift-merge-robot

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci[bot] avatar Oct 01 '25 08:10 openshift-ci[bot]