origin icon indicating copy to clipboard operation
origin copied to clipboard

coreos/kdump: Add kdump e2e test using mco

Open gursewak1997 opened this issue 3 years ago • 22 comments

Add e2e test for OCP CI that validates enabling kdump and generating kernel core via machine config successfully. This is one of the steps to to enhance the kdump feature.

gursewak1997 avatar Jul 05 '22 22:07 gursewak1997

@gursewak1997 ci/prow/verify is failing like:

FAILURE after 27.595s: hack/verify-generated.sh:13: executing '/go/src/github.com/openshift/origin/hack/update-generated.sh' expecting success: the command returned the wrong error code
There was no output from the command.
Standard error from the command:
failed: all tests must define a [sig-XXXX] tag or have a rule "[Top Level] kdump TestKdump"
exit status 1

miabbott avatar Jul 07 '22 15:07 miabbott

failed: all tests must define a [sig-XXXX] tag or have a rule "[Top Level] kdump TestKdump"
exit status 1

Yup I am going over the doc to add the relevant tags before I re-commit.

gursewak1997 avatar Jul 07 '22 15:07 gursewak1997

/retest

gursewak1997 avatar Jul 12 '22 23:07 gursewak1997

the e2e-aws-single-node* tests continue to hit the Failed while waiting on imagestream import problem that was affecting the broader CI fleet...but since they are not required, we can merge over red here.

miabbott avatar Jul 13 '22 13:07 miabbott

Another question for due to unfamiliarity...when/where does this new test run? Will it be part of CI jobs?

I'm struggling to find evidence that this test was run in any of the CI jobs that ran against the PR. (Furthermore, I can only see the other [sig-coreos] test running as part of e2e-gcp)

miabbott avatar Jul 13 '22 13:07 miabbott

/test help

miabbott avatar Jul 13 '22 13:07 miabbott

@miabbott: The specified target(s) for /test were not found. The following commands are available to trigger required jobs:

  • /test e2e-aws-fips
  • /test e2e-aws-image-registry
  • /test e2e-aws-jenkins
  • /test e2e-aws-serial
  • /test e2e-gcp
  • /test e2e-gcp-builds
  • /test e2e-gcp-image-ecosystem
  • /test e2e-gcp-upgrade
  • /test extended_gssapi
  • /test extended_ldap_groups
  • /test extended_networking
  • /test images
  • /test lint
  • /test verify
  • /test verify-deps

The following commands are available to trigger optional jobs:

  • /test e2e-agnostic-cmd
  • /test e2e-aws
  • /test e2e-aws-cgroupsv2
  • /test e2e-aws-csi
  • /test e2e-aws-csi-migration
  • /test e2e-aws-disruptive
  • /test e2e-aws-multitenant
  • /test e2e-aws-ovn
  • /test e2e-aws-proxy
  • /test e2e-aws-single-node
  • /test e2e-aws-single-node-serial
  • /test e2e-aws-single-node-upgrade
  • /test e2e-aws-upgrade
  • /test e2e-azure
  • /test e2e-gcp-csi
  • /test e2e-gcp-disruptive
  • /test e2e-gcp-fips-serial
  • /test e2e-gcp-ovn-rt-upgrade
  • /test e2e-metal-ipi
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-metal-ipi-ovn-ipv6
  • /test e2e-metal-ipi-serial
  • /test e2e-metal-ipi-serial-ovn-ipv6
  • /test e2e-metal-ipi-virtualmedia
  • /test e2e-openstack
  • /test e2e-openstack-serial
  • /test e2e-vsphere
  • /test okd-e2e-gcp

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-origin-master-e2e-agnostic-cmd
  • pull-ci-openshift-origin-master-e2e-aws-cgroupsv2
  • pull-ci-openshift-origin-master-e2e-aws-csi
  • pull-ci-openshift-origin-master-e2e-aws-fips
  • pull-ci-openshift-origin-master-e2e-aws-serial
  • pull-ci-openshift-origin-master-e2e-aws-single-node
  • pull-ci-openshift-origin-master-e2e-aws-single-node-upgrade
  • pull-ci-openshift-origin-master-e2e-gcp
  • pull-ci-openshift-origin-master-e2e-gcp-builds
  • pull-ci-openshift-origin-master-e2e-gcp-csi
  • pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade
  • pull-ci-openshift-origin-master-e2e-gcp-upgrade
  • pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6
  • pull-ci-openshift-origin-master-images
  • pull-ci-openshift-origin-master-lint
  • pull-ci-openshift-origin-master-verify
  • pull-ci-openshift-origin-master-verify-deps

In response to this:

/test help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci[bot] avatar Jul 13 '22 13:07 openshift-ci[bot]

Another question for due to unfamiliarity...when/where does this new test run? Will it be part of CI jobs?

I'm struggling to find evidence that this test was run in any of the CI jobs that ran against the PR. (Furthermore, I can only see the other [sig-coreos] test running as part of e2e-gcp)

Working on how/when tests run myself. Ideally, the kdump test should definitely run for this PR. I did see the test running in one of the initial tests where it failed. After I reran, the kdump test didn't run and the overall CI check passed.

gursewak1997 avatar Jul 13 '22 16:07 gursewak1997

/assign travier

travier avatar Jul 18 '22 15:07 travier

See e.g. https://github.com/openshift/machine-config-operator/commit/825be33519852121fc1cc94695d1a759fb7e218b which we need to copy into privileged pods now as part of a recent security policy change

cgwalters avatar Jul 19 '22 17:07 cgwalters

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gursewak1997 Once this PR has been reviewed and has the lgtm label, please assign adambkaplan for approval by writing /assign @adambkaplan in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Jul 20 '22 02:07 openshift-ci[bot]

/retest

gursewak1997 avatar Jul 21 '22 15:07 gursewak1997

/test

stbenjam avatar Aug 02 '22 18:08 stbenjam

@stbenjam: The /test command needs one or more targets. The following commands are available to trigger required jobs:

  • /test e2e-aws-fips
  • /test e2e-aws-image-registry
  • /test e2e-aws-jenkins
  • /test e2e-aws-serial
  • /test e2e-gcp
  • /test e2e-gcp-builds
  • /test e2e-gcp-image-ecosystem
  • /test e2e-gcp-upgrade
  • /test extended_gssapi
  • /test extended_ldap_groups
  • /test extended_networking
  • /test images
  • /test lint
  • /test verify
  • /test verify-deps

The following commands are available to trigger optional jobs:

  • /test e2e-agnostic-cmd
  • /test e2e-aws
  • /test e2e-aws-cgroupsv2
  • /test e2e-aws-csi
  • /test e2e-aws-csi-migration
  • /test e2e-aws-disruptive
  • /test e2e-aws-multitenant
  • /test e2e-aws-ovn
  • /test e2e-aws-proxy
  • /test e2e-aws-single-node
  • /test e2e-aws-single-node-serial
  • /test e2e-aws-single-node-upgrade
  • /test e2e-aws-upgrade
  • /test e2e-azure
  • /test e2e-gcp-csi
  • /test e2e-gcp-disruptive
  • /test e2e-gcp-fips-serial
  • /test e2e-gcp-ovn-rt-upgrade
  • /test e2e-metal-ipi
  • /test e2e-metal-ipi-ovn-dualstack
  • /test e2e-metal-ipi-ovn-ipv6
  • /test e2e-metal-ipi-serial
  • /test e2e-metal-ipi-serial-ovn-ipv6
  • /test e2e-metal-ipi-virtualmedia
  • /test e2e-openstack
  • /test e2e-openstack-serial
  • /test e2e-vsphere
  • /test okd-e2e-gcp

Use /test all to run the following jobs that were automatically triggered:

  • pull-ci-openshift-origin-master-e2e-agnostic-cmd
  • pull-ci-openshift-origin-master-e2e-aws-cgroupsv2
  • pull-ci-openshift-origin-master-e2e-aws-csi
  • pull-ci-openshift-origin-master-e2e-aws-fips
  • pull-ci-openshift-origin-master-e2e-aws-serial
  • pull-ci-openshift-origin-master-e2e-aws-single-node
  • pull-ci-openshift-origin-master-e2e-aws-single-node-upgrade
  • pull-ci-openshift-origin-master-e2e-gcp
  • pull-ci-openshift-origin-master-e2e-gcp-builds
  • pull-ci-openshift-origin-master-e2e-gcp-csi
  • pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade
  • pull-ci-openshift-origin-master-e2e-gcp-upgrade
  • pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6
  • pull-ci-openshift-origin-master-images
  • pull-ci-openshift-origin-master-lint
  • pull-ci-openshift-origin-master-verify
  • pull-ci-openshift-origin-master-verify-deps

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci[bot] avatar Aug 02 '22 18:08 openshift-ci[bot]

We've been bitten hard by some unreliable serial tests lately, I'd feel better if I saw it pass a couple times on some different configurations.

/test e2e-aws-serial /test e2e-gcp-fips-serial /test e2e-gcp-fips-serial /test e2e-metal-ipi-serial /test e2e-metal-ipi-serial-ovn-ipv6 /test e2e-openstack-serial

stbenjam avatar Aug 02 '22 18:08 stbenjam

Note that this test actively makes a node crash, so I'm not sure how we should account for that in general.

travier avatar Aug 02 '22 18:08 travier

Note that this test actively makes a node crash, so I'm not sure how we should account for that in general.

We have some synthetic tests that hunts for segfaults and go panics, but I don't think anything is looking for kernel panics. Does the node recover? I'm wondering if it might not because machine-config goes degraded....

See this run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27291/pull-ci-openshift-origin-master-e2e-aws-serial/1550551517881176064

{Operator degraded (RequiredPoolsFailed): Failed to resync 4.12.0-0.ci.test-2022-07-22-185749-ci-op-59ryclwt-latest because: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)]  Operator degraded (RequiredPoolsFailed): Failed to resync 4.12.0-0.ci.test-2022-07-22-185749-ci-op-59ryclwt-latest because: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)]}

stbenjam avatar Aug 02 '22 18:08 stbenjam

As it's a fake crash, the node should just reboot after dumping the crash dump. The unavailability should be temporary.

travier avatar Aug 03 '22 11:08 travier

See this run: prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27291/pull-ci-openshift-origin-master-e2e-aws-serial/1550551517881176064

{Operator degraded (RequiredPoolsFailed): Failed to resync 4.12.0-0.ci.test-2022-07-22-185749-ci-op-59ryclwt-latest because: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)]  Operator degraded (RequiredPoolsFailed): Failed to resync 4.12.0-0.ci.test-2022-07-22-185749-ci-op-59ryclwt-latest because: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)]}

But indeed this would be the symptoms if the node would not reboot for any reason.

travier avatar Aug 03 '22 11:08 travier

/test e2e-aws-disruptive

gursewak1997 avatar Aug 08 '22 02:08 gursewak1997

I updated the tags for this test to [Slow](Since the test typically took more than 5 minutes to finish), and [Disruptive](Since it includes rebooting a node). Also, since any [Disruptive] test is also assumed to qualify for the [Serial] label, but need not be labelled as both as per this doc, I dropped the [Serial] label. I am not too sure in which job the kdump test should run now because I don't see it in ci/prow/e2e-aws-disruptiveorci/prow/e2e-aws-serial` On the other hand, I have also updated the test not to have any degraded nodes after the test finishes.

gursewak1997 avatar Aug 08 '22 17:08 gursewak1997

@gursewak1997: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-fips-serial f1fd75e7c018f7e5452a9eeb20cec0d2fcc3937e link false /test e2e-gcp-fips-serial
ci/prow/e2e-metal-ipi-serial-ovn-ipv6 f1fd75e7c018f7e5452a9eeb20cec0d2fcc3937e link false /test e2e-metal-ipi-serial-ovn-ipv6
ci/prow/e2e-openstack-serial f1fd75e7c018f7e5452a9eeb20cec0d2fcc3937e link false /test e2e-openstack-serial
ci/prow/e2e-metal-ipi-serial f1fd75e7c018f7e5452a9eeb20cec0d2fcc3937e link false /test e2e-metal-ipi-serial
ci/prow/e2e-aws-single-node-upgrade d2235fdb6f3102c467ab88832688bd72ab6dfd98 link false /test e2e-aws-single-node-upgrade
ci/prow/e2e-metal-ipi-ovn-ipv6 d2235fdb6f3102c467ab88832688bd72ab6dfd98 link false /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-gcp-ovn-rt-upgrade d2235fdb6f3102c467ab88832688bd72ab6dfd98 link false /test e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-aws-serial d2235fdb6f3102c467ab88832688bd72ab6dfd98 link true /test e2e-aws-serial
ci/prow/e2e-aws-single-node d2235fdb6f3102c467ab88832688bd72ab6dfd98 link false /test e2e-aws-single-node
ci/prow/e2e-aws-disruptive d2235fdb6f3102c467ab88832688bd72ab6dfd98 link false /test e2e-aws-disruptive
ci/prow/e2e-gcp-ovn d2235fdb6f3102c467ab88832688bd72ab6dfd98 link true /test e2e-gcp-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Aug 31 '22 20:08 openshift-ci[bot]

@gursewak1997: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-merge-robot avatar Oct 12 '22 13:10 openshift-merge-robot

@gursewak1997: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-fips-serial f1fd75e7c018f7e5452a9eeb20cec0d2fcc3937e link false /test e2e-gcp-fips-serial
ci/prow/e2e-metal-ipi-serial-ovn-ipv6 f1fd75e7c018f7e5452a9eeb20cec0d2fcc3937e link false /test e2e-metal-ipi-serial-ovn-ipv6
ci/prow/e2e-openstack-serial f1fd75e7c018f7e5452a9eeb20cec0d2fcc3937e link false /test e2e-openstack-serial
ci/prow/e2e-metal-ipi-serial f1fd75e7c018f7e5452a9eeb20cec0d2fcc3937e link false /test e2e-metal-ipi-serial
ci/prow/e2e-aws-single-node-upgrade d2235fdb6f3102c467ab88832688bd72ab6dfd98 link false /test e2e-aws-single-node-upgrade
ci/prow/e2e-metal-ipi-ovn-ipv6 d2235fdb6f3102c467ab88832688bd72ab6dfd98 link false /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-gcp-ovn-rt-upgrade d2235fdb6f3102c467ab88832688bd72ab6dfd98 link false /test e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-aws-serial d2235fdb6f3102c467ab88832688bd72ab6dfd98 link true /test e2e-aws-serial
ci/prow/e2e-aws-single-node d2235fdb6f3102c467ab88832688bd72ab6dfd98 link false /test e2e-aws-single-node
ci/prow/e2e-aws-disruptive d2235fdb6f3102c467ab88832688bd72ab6dfd98 link false /test e2e-aws-disruptive
ci/prow/e2e-gcp-ovn d2235fdb6f3102c467ab88832688bd72ab6dfd98 link true /test e2e-gcp-ovn
ci/prow/unit d2235fdb6f3102c467ab88832688bd72ab6dfd98 link true /test unit
ci/prow/e2e-gcp-ovn-image-ecosystem d2235fdb6f3102c467ab88832688bd72ab6dfd98 link true /test e2e-gcp-ovn-image-ecosystem
ci/prow/e2e-gcp-ovn-builds d2235fdb6f3102c467ab88832688bd72ab6dfd98 link true /test e2e-gcp-ovn-builds
ci/prow/e2e-aws-ovn-image-registry d2235fdb6f3102c467ab88832688bd72ab6dfd98 link true /test e2e-aws-ovn-image-registry

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Nov 04 '22 21:11 openshift-ci[bot]

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot avatar Feb 03 '23 01:02 openshift-bot

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

openshift-bot avatar Mar 05 '23 08:03 openshift-bot

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-bot avatar Apr 05 '23 00:04 openshift-bot

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci[bot] avatar Apr 05 '23 00:04 openshift-ci[bot]