origin icon indicating copy to clipboard operation
origin copied to clipboard

Live migration: dont rollback within the same test as migration

Open martinkennelly opened this issue 1 year ago • 11 comments

After merging the code and applying the jobs for migration and rollback, we noticed that monitors have a toleration for disruption. This toleration was configured for upgrade. For our job, we perform migration and rollback. Its expected the disruption time will be larger.

Instead of configuring the monitors to have a custom disruption duration, simply call the test twice for migration and rollback and have new instances of the monitors.

This PR performs the migration of the CNI when called. Within the CI jobs, we can simply call this test twice to perform migration and rollback.

A point to note is that this test is active within the 'all' suite.

martinkennelly avatar Jan 26 '24 15:01 martinkennelly

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: martinkennelly Once this PR has been reviewed and has the lgtm label, please assign bparees for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Jan 26 '24 15:01 openshift-ci[bot]

@martinkennelly: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-openstack-ovn 7c1388cc425233124f0c621d935a2cc25607d1b5 link false /test e2e-openstack-ovn
ci/prow/e2e-agnostic-ovn-cmd 7c1388cc425233124f0c621d935a2cc25607d1b5 link false /test e2e-agnostic-ovn-cmd
ci/prow/e2e-aws-ovn-single-node-serial 7c1388cc425233124f0c621d935a2cc25607d1b5 link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-gcp-ovn-rt-upgrade 7c1388cc425233124f0c621d935a2cc25607d1b5 link false /test e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-aws-ovn-single-node 7c1388cc425233124f0c621d935a2cc25607d1b5 link false /test e2e-aws-ovn-single-node
ci/prow/e2e-aws-ovn-single-node-upgrade 7c1388cc425233124f0c621d935a2cc25607d1b5 link false /test e2e-aws-ovn-single-node-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci[bot] avatar Jan 26 '24 18:01 openshift-ci[bot]

@dgoodwin PTAL - I need advice regarding this test now being included in the "all" suite. It will migrate from OVN-K to SDN which is .. weird. Id rather disable it for the all suite and only allow the exec of this test when the migration suite is called specifically.

I checked and its not selected for any other suite.

martinkennelly avatar Jan 29 '24 10:01 martinkennelly

Can you paste links to job runs where you are noticing increased disruption? I would like to see how serious the delta is. Does the team feel it's acceptable to be taking this increased hit in disruption during this migration? Is it a bug that actually should be fixed essentially.

This is quite complicated, trying to figure out what to do. There are no tolerations per se, we just collect data and then slice on various dimensions, but there is no dimension for ovn->sdn migrations. Creating the monitors multiple times is unlikely to help as the data is still aggregated for the job run and submitted to the bigquery database, which later feeds into the limits the tests will fail on. That is likely the toleration you are referring to. We might have to explicitly exclude all your results based on a job naming convention.

dgoodwin avatar Jan 30 '24 13:01 dgoodwin

Jumping ahead, I think we need to find a way to (a) disable the disruption testing for your job based on something we can identify in https://github.com/openshift/origin/blob/master/pkg/monitortestlibrary/disruptionlibrary/disruption_invariant_adapter.go#L224 to skip those tests.

This will allow us to still collect data for your job so we could answer the question, how much disruption would a customer expect going through this process.

Then (b) we need to filter our your jobs via a consistent naming scheme in our scheduled queries that report disruption data daily. This will keep your results out of the dashboard and regression detection.

These tests can't be a part of the all suite afaict, they would need to be their own. Does this sound reasonable?

dgoodwin avatar Jan 30 '24 15:01 dgoodwin

Can you paste links to job runs where you are noticing increased disruption?

https://prow.ci.openshift.org/job-history/gs/test-platform-results/pr-logs/directory/pull-ci-openshift-cluster-network-operator-master-e2e-aws-live-migration-sdn-ovn-rollback

Does the team feel it's acceptable to be taking this increased hit in disruption during this migration?

This increased disruption sometimes happens (1/4 or 1/5) when we migrate to a CNI and then rollback. So essentially its two migrations. We do not see it during a single migration. All is good there.

Creating the monitors multiple times is unlikely to help as the data is still aggregated for the job run and submitted to the bigquery database, which later feeds into the limits the tests will fail on.

I see. Thanks for this info. I thought every time we exec openshift-tests run .. within the job, it would aggregate and calculate disruption for each invocation.

martinkennelly avatar Jan 30 '24 16:01 martinkennelly

These tests can't be a part of the all suite afaict, they would need to be their own. Does this sound reasonable?

yes

martinkennelly avatar Jan 30 '24 16:01 martinkennelly

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot avatar Apr 30 '24 01:04 openshift-bot

/remove-lifecycle stale

martinkennelly avatar Apr 30 '24 08:04 martinkennelly

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot avatar Jul 29 '24 09:07 openshift-bot

/remove-lifecycle stale

martinkennelly avatar Jul 29 '24 09:07 martinkennelly