origin coreos/kdump: Add kdump e2e test using mco

Add e2e test for OCP CI that validates enabling kdump and generating kernel core via machine config successfully. This is one of the steps to to enhance the kdump feature.

Jul 05 '22 22:07 gursewak1997

@gursewak1997 ci/prow/verify is failing like:

FAILURE after 27.595s: hack/verify-generated.sh:13: executing '/go/src/github.com/openshift/origin/hack/update-generated.sh' expecting success: the command returned the wrong error code
There was no output from the command.
Standard error from the command:
failed: all tests must define a [sig-XXXX] tag or have a rule "[Top Level] kdump TestKdump"
exit status 1

Jul 07 '22 15:07 miabbott

failed: all tests must define a [sig-XXXX] tag or have a rule "[Top Level] kdump TestKdump"
exit status 1

Yup I am going over the doc to add the relevant tags before I re-commit.

Jul 07 '22 15:07 gursewak1997

/retest

Jul 12 '22 23:07 gursewak1997

the e2e-aws-single-node* tests continue to hit the Failed while waiting on imagestream import problem that was affecting the broader CI fleet...but since they are not required, we can merge over red here.

Jul 13 '22 13:07 miabbott

Another question for due to unfamiliarity...when/where does this new test run? Will it be part of CI jobs?

I'm struggling to find evidence that this test was run in any of the CI jobs that ran against the PR. (Furthermore, I can only see the other [sig-coreos] test running as part of e2e-gcp)

Jul 13 '22 13:07 miabbott

/test help

Jul 13 '22 13:07 miabbott

@miabbott: The specified target(s) for /test were not found. The following commands are available to trigger required jobs:

/test e2e-aws-fips
/test e2e-aws-image-registry
/test e2e-aws-jenkins
/test e2e-aws-serial
/test e2e-gcp
/test e2e-gcp-builds
/test e2e-gcp-image-ecosystem
/test e2e-gcp-upgrade
/test extended_gssapi
/test extended_ldap_groups
/test extended_networking
/test images
/test lint
/test verify
/test verify-deps

The following commands are available to trigger optional jobs:

/test e2e-agnostic-cmd
/test e2e-aws
/test e2e-aws-cgroupsv2
/test e2e-aws-csi
/test e2e-aws-csi-migration
/test e2e-aws-disruptive
/test e2e-aws-multitenant
/test e2e-aws-ovn
/test e2e-aws-proxy
/test e2e-aws-single-node
/test e2e-aws-single-node-serial
/test e2e-aws-single-node-upgrade
/test e2e-aws-upgrade
/test e2e-azure
/test e2e-gcp-csi
/test e2e-gcp-disruptive
/test e2e-gcp-fips-serial
/test e2e-gcp-ovn-rt-upgrade
/test e2e-metal-ipi
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-ipv6
/test e2e-metal-ipi-serial
/test e2e-metal-ipi-serial-ovn-ipv6
/test e2e-metal-ipi-virtualmedia
/test e2e-openstack
/test e2e-openstack-serial
/test e2e-vsphere
/test okd-e2e-gcp

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-origin-master-e2e-agnostic-cmd
pull-ci-openshift-origin-master-e2e-aws-cgroupsv2
pull-ci-openshift-origin-master-e2e-aws-csi
pull-ci-openshift-origin-master-e2e-aws-fips
pull-ci-openshift-origin-master-e2e-aws-serial
pull-ci-openshift-origin-master-e2e-aws-single-node
pull-ci-openshift-origin-master-e2e-aws-single-node-upgrade
pull-ci-openshift-origin-master-e2e-gcp
pull-ci-openshift-origin-master-e2e-gcp-builds
pull-ci-openshift-origin-master-e2e-gcp-csi
pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade
pull-ci-openshift-origin-master-e2e-gcp-upgrade
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6
pull-ci-openshift-origin-master-images
pull-ci-openshift-origin-master-lint
pull-ci-openshift-origin-master-verify
pull-ci-openshift-origin-master-verify-deps

In response to this:

/test help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Jul 13 '22 13:07 openshift-ci[bot]

Another question for due to unfamiliarity...when/where does this new test run? Will it be part of CI jobs?

I'm struggling to find evidence that this test was run in any of the CI jobs that ran against the PR. (Furthermore, I can only see the other [sig-coreos] test running as part of e2e-gcp)

Working on how/when tests run myself. Ideally, the kdump test should definitely run for this PR. I did see the test running in one of the initial tests where it failed. After I reran, the kdump test didn't run and the overall CI check passed.

Jul 13 '22 16:07 gursewak1997

/assign travier

Jul 18 '22 15:07 travier

See e.g. https://github.com/openshift/machine-config-operator/commit/825be33519852121fc1cc94695d1a759fb7e218b which we need to copy into privileged pods now as part of a recent security policy change

Jul 19 '22 17:07 cgwalters

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gursewak1997 Once this PR has been reviewed and has the lgtm label, please assign adambkaplan for approval by writing /assign @adambkaplan in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

test/extended/OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Jul 20 '22 02:07 openshift-ci[bot]

/retest

Jul 21 '22 15:07 gursewak1997

/test

Aug 02 '22 18:08 stbenjam

@stbenjam: The /test command needs one or more targets. The following commands are available to trigger required jobs:

/test e2e-aws-fips
/test e2e-aws-image-registry
/test e2e-aws-jenkins
/test e2e-aws-serial
/test e2e-gcp
/test e2e-gcp-builds
/test e2e-gcp-image-ecosystem
/test e2e-gcp-upgrade
/test extended_gssapi
/test extended_ldap_groups
/test extended_networking
/test images
/test lint
/test verify
/test verify-deps

The following commands are available to trigger optional jobs:

/test e2e-agnostic-cmd
/test e2e-aws
/test e2e-aws-cgroupsv2
/test e2e-aws-csi
/test e2e-aws-csi-migration
/test e2e-aws-disruptive
/test e2e-aws-multitenant
/test e2e-aws-ovn
/test e2e-aws-proxy
/test e2e-aws-single-node
/test e2e-aws-single-node-serial
/test e2e-aws-single-node-upgrade
/test e2e-aws-upgrade
/test e2e-azure
/test e2e-gcp-csi
/test e2e-gcp-disruptive
/test e2e-gcp-fips-serial
/test e2e-gcp-ovn-rt-upgrade
/test e2e-metal-ipi
/test e2e-metal-ipi-ovn-dualstack
/test e2e-metal-ipi-ovn-ipv6
/test e2e-metal-ipi-serial
/test e2e-metal-ipi-serial-ovn-ipv6
/test e2e-metal-ipi-virtualmedia
/test e2e-openstack
/test e2e-openstack-serial
/test e2e-vsphere
/test okd-e2e-gcp

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-origin-master-e2e-agnostic-cmd
pull-ci-openshift-origin-master-e2e-aws-cgroupsv2
pull-ci-openshift-origin-master-e2e-aws-csi
pull-ci-openshift-origin-master-e2e-aws-fips
pull-ci-openshift-origin-master-e2e-aws-serial
pull-ci-openshift-origin-master-e2e-aws-single-node
pull-ci-openshift-origin-master-e2e-aws-single-node-upgrade
pull-ci-openshift-origin-master-e2e-gcp
pull-ci-openshift-origin-master-e2e-gcp-builds
pull-ci-openshift-origin-master-e2e-gcp-csi
pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade
pull-ci-openshift-origin-master-e2e-gcp-upgrade
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-ipv6
pull-ci-openshift-origin-master-images
pull-ci-openshift-origin-master-lint
pull-ci-openshift-origin-master-verify
pull-ci-openshift-origin-master-verify-deps

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 02 '22 18:08 openshift-ci[bot]

We've been bitten hard by some unreliable serial tests lately, I'd feel better if I saw it pass a couple times on some different configurations.

/test e2e-aws-serial /test e2e-gcp-fips-serial /test e2e-gcp-fips-serial /test e2e-metal-ipi-serial /test e2e-metal-ipi-serial-ovn-ipv6 /test e2e-openstack-serial

Aug 02 '22 18:08 stbenjam

Note that this test actively makes a node crash, so I'm not sure how we should account for that in general.

Aug 02 '22 18:08 travier

Note that this test actively makes a node crash, so I'm not sure how we should account for that in general.

We have some synthetic tests that hunts for segfaults and go panics, but I don't think anything is looking for kernel panics. Does the node recover? I'm wondering if it might not because machine-config goes degraded....

See this run: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27291/pull-ci-openshift-origin-master-e2e-aws-serial/1550551517881176064

{Operator degraded (RequiredPoolsFailed): Failed to resync 4.12.0-0.ci.test-2022-07-22-185749-ci-op-59ryclwt-latest because: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)]  Operator degraded (RequiredPoolsFailed): Failed to resync 4.12.0-0.ci.test-2022-07-22-185749-ci-op-59ryclwt-latest because: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)]}

Aug 02 '22 18:08 stbenjam

As it's a fake crash, the node should just reboot after dumping the crash dump. The unavailability should be temporary.

Aug 03 '22 11:08 travier

See this run: prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27291/pull-ci-openshift-origin-master-e2e-aws-serial/1550551517881176064

{Operator degraded (RequiredPoolsFailed): Failed to resync 4.12.0-0.ci.test-2022-07-22-185749-ci-op-59ryclwt-latest because: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)]  Operator degraded (RequiredPoolsFailed): Failed to resync 4.12.0-0.ci.test-2022-07-22-185749-ci-op-59ryclwt-latest because: error during syncRequiredMachineConfigPools: [timed out waiting for the condition, error pool worker is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)]}

But indeed this would be the symptoms if the node would not reboot for any reason.

Aug 03 '22 11:08 travier

/test e2e-aws-disruptive

Aug 08 '22 02:08 gursewak1997

I updated the tags for this test to [Slow](Since the test typically took more than 5 minutes to finish), and [Disruptive](Since it includes rebooting a node). Also, since any [Disruptive] test is also assumed to qualify for the [Serial] label, but need not be labelled as both as per this doc, I dropped the [Serial] label. I am not too sure in which job the kdump test should run now because I don't see it in ci/prow/e2e-aws-disruptiveorci/prow/e2e-aws-serial` On the other hand, I have also updated the test not to have any degraded nodes after the test finishes.

Aug 08 '22 17:08 gursewak1997

@gursewak1997: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gcp-fips-serial	f1fd75e7c018f7e5452a9eeb20cec0d2fcc3937e	link	false	`/test e2e-gcp-fips-serial`
ci/prow/e2e-metal-ipi-serial-ovn-ipv6	f1fd75e7c018f7e5452a9eeb20cec0d2fcc3937e	link	false	`/test e2e-metal-ipi-serial-ovn-ipv6`
ci/prow/e2e-openstack-serial	f1fd75e7c018f7e5452a9eeb20cec0d2fcc3937e	link	false	`/test e2e-openstack-serial`
ci/prow/e2e-metal-ipi-serial	f1fd75e7c018f7e5452a9eeb20cec0d2fcc3937e	link	false	`/test e2e-metal-ipi-serial`
ci/prow/e2e-aws-single-node-upgrade	d2235fdb6f3102c467ab88832688bd72ab6dfd98	link	false	`/test e2e-aws-single-node-upgrade`
ci/prow/e2e-metal-ipi-ovn-ipv6	d2235fdb6f3102c467ab88832688bd72ab6dfd98	link	false	`/test e2e-metal-ipi-ovn-ipv6`
ci/prow/e2e-gcp-ovn-rt-upgrade	d2235fdb6f3102c467ab88832688bd72ab6dfd98	link	false	`/test e2e-gcp-ovn-rt-upgrade`
ci/prow/e2e-aws-serial	d2235fdb6f3102c467ab88832688bd72ab6dfd98	link	true	`/test e2e-aws-serial`
ci/prow/e2e-aws-single-node	d2235fdb6f3102c467ab88832688bd72ab6dfd98	link	false	`/test e2e-aws-single-node`
ci/prow/e2e-aws-disruptive	d2235fdb6f3102c467ab88832688bd72ab6dfd98	link	false	`/test e2e-aws-disruptive`
ci/prow/e2e-gcp-ovn	d2235fdb6f3102c467ab88832688bd72ab6dfd98	link	true	`/test e2e-gcp-ovn`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Aug 31 '22 20:08 openshift-ci[bot]

@gursewak1997: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Oct 12 '22 13:10 openshift-merge-robot

@gursewak1997: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gcp-fips-serial	f1fd75e7c018f7e5452a9eeb20cec0d2fcc3937e	link	false	`/test e2e-gcp-fips-serial`
ci/prow/e2e-metal-ipi-serial-ovn-ipv6	f1fd75e7c018f7e5452a9eeb20cec0d2fcc3937e	link	false	`/test e2e-metal-ipi-serial-ovn-ipv6`
ci/prow/e2e-openstack-serial	f1fd75e7c018f7e5452a9eeb20cec0d2fcc3937e	link	false	`/test e2e-openstack-serial`
ci/prow/e2e-metal-ipi-serial	f1fd75e7c018f7e5452a9eeb20cec0d2fcc3937e	link	false	`/test e2e-metal-ipi-serial`
ci/prow/e2e-aws-single-node-upgrade	d2235fdb6f3102c467ab88832688bd72ab6dfd98	link	false	`/test e2e-aws-single-node-upgrade`
ci/prow/e2e-metal-ipi-ovn-ipv6	d2235fdb6f3102c467ab88832688bd72ab6dfd98	link	false	`/test e2e-metal-ipi-ovn-ipv6`
ci/prow/e2e-gcp-ovn-rt-upgrade	d2235fdb6f3102c467ab88832688bd72ab6dfd98	link	false	`/test e2e-gcp-ovn-rt-upgrade`
ci/prow/e2e-aws-serial	d2235fdb6f3102c467ab88832688bd72ab6dfd98	link	true	`/test e2e-aws-serial`
ci/prow/e2e-aws-single-node	d2235fdb6f3102c467ab88832688bd72ab6dfd98	link	false	`/test e2e-aws-single-node`
ci/prow/e2e-aws-disruptive	d2235fdb6f3102c467ab88832688bd72ab6dfd98	link	false	`/test e2e-aws-disruptive`
ci/prow/e2e-gcp-ovn	d2235fdb6f3102c467ab88832688bd72ab6dfd98	link	true	`/test e2e-gcp-ovn`
ci/prow/unit	d2235fdb6f3102c467ab88832688bd72ab6dfd98	link	true	`/test unit`
ci/prow/e2e-gcp-ovn-image-ecosystem	d2235fdb6f3102c467ab88832688bd72ab6dfd98	link	true	`/test e2e-gcp-ovn-image-ecosystem`
ci/prow/e2e-gcp-ovn-builds	d2235fdb6f3102c467ab88832688bd72ab6dfd98	link	true	`/test e2e-gcp-ovn-builds`
ci/prow/e2e-aws-ovn-image-registry	d2235fdb6f3102c467ab88832688bd72ab6dfd98	link	true	`/test e2e-aws-ovn-image-registry`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Nov 04 '22 21:11 openshift-ci[bot]

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Feb 03 '23 01:02 openshift-bot

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

Mar 05 '23 08:03 openshift-bot

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Apr 05 '23 00:04 openshift-bot

@openshift-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen. Mark the issue as fresh by commenting /remove-lifecycle rotten. Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 05 '23 00:04 openshift-ci[bot]