origin Live migration: dont rollback within the same test as migration

After merging the code and applying the jobs for migration and rollback, we noticed that monitors have a toleration for disruption. This toleration was configured for upgrade. For our job, we perform migration and rollback. Its expected the disruption time will be larger.

Instead of configuring the monitors to have a custom disruption duration, simply call the test twice for migration and rollback and have new instances of the monitors.

This PR performs the migration of the CNI when called. Within the CI jobs, we can simply call this test twice to perform migration and rollback.

A point to note is that this test is active within the 'all' suite.

Jan 26 '24 15:01 martinkennelly

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: martinkennelly Once this PR has been reviewed and has the lgtm label, please assign bparees for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

test/extended/networking/OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Jan 26 '24 15:01 openshift-ci[bot]

@martinkennelly: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-openstack-ovn	7c1388cc425233124f0c621d935a2cc25607d1b5	link	false	`/test e2e-openstack-ovn`
ci/prow/e2e-agnostic-ovn-cmd	7c1388cc425233124f0c621d935a2cc25607d1b5	link	false	`/test e2e-agnostic-ovn-cmd`
ci/prow/e2e-aws-ovn-single-node-serial	7c1388cc425233124f0c621d935a2cc25607d1b5	link	false	`/test e2e-aws-ovn-single-node-serial`
ci/prow/e2e-gcp-ovn-rt-upgrade	7c1388cc425233124f0c621d935a2cc25607d1b5	link	false	`/test e2e-gcp-ovn-rt-upgrade`
ci/prow/e2e-aws-ovn-single-node	7c1388cc425233124f0c621d935a2cc25607d1b5	link	false	`/test e2e-aws-ovn-single-node`
ci/prow/e2e-aws-ovn-single-node-upgrade	7c1388cc425233124f0c621d935a2cc25607d1b5	link	false	`/test e2e-aws-ovn-single-node-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Jan 26 '24 18:01 openshift-ci[bot]

@dgoodwin PTAL - I need advice regarding this test now being included in the "all" suite. It will migrate from OVN-K to SDN which is .. weird. Id rather disable it for the all suite and only allow the exec of this test when the migration suite is called specifically.

I checked and its not selected for any other suite.

Jan 29 '24 10:01 martinkennelly

Can you paste links to job runs where you are noticing increased disruption? I would like to see how serious the delta is. Does the team feel it's acceptable to be taking this increased hit in disruption during this migration? Is it a bug that actually should be fixed essentially.

This is quite complicated, trying to figure out what to do. There are no tolerations per se, we just collect data and then slice on various dimensions, but there is no dimension for ovn->sdn migrations. Creating the monitors multiple times is unlikely to help as the data is still aggregated for the job run and submitted to the bigquery database, which later feeds into the limits the tests will fail on. That is likely the toleration you are referring to. We might have to explicitly exclude all your results based on a job naming convention.

Jan 30 '24 13:01 dgoodwin

Jumping ahead, I think we need to find a way to (a) disable the disruption testing for your job based on something we can identify in https://github.com/openshift/origin/blob/master/pkg/monitortestlibrary/disruptionlibrary/disruption_invariant_adapter.go#L224 to skip those tests.

This will allow us to still collect data for your job so we could answer the question, how much disruption would a customer expect going through this process.

Then (b) we need to filter our your jobs via a consistent naming scheme in our scheduled queries that report disruption data daily. This will keep your results out of the dashboard and regression detection.

These tests can't be a part of the all suite afaict, they would need to be their own. Does this sound reasonable?

Jan 30 '24 15:01 dgoodwin

Can you paste links to job runs where you are noticing increased disruption?

https://prow.ci.openshift.org/job-history/gs/test-platform-results/pr-logs/directory/pull-ci-openshift-cluster-network-operator-master-e2e-aws-live-migration-sdn-ovn-rollback

Does the team feel it's acceptable to be taking this increased hit in disruption during this migration?

This increased disruption sometimes happens (1/4 or 1/5) when we migrate to a CNI and then rollback. So essentially its two migrations. We do not see it during a single migration. All is good there.

Creating the monitors multiple times is unlikely to help as the data is still aggregated for the job run and submitted to the bigquery database, which later feeds into the limits the tests will fail on.

I see. Thanks for this info. I thought every time we exec openshift-tests run .. within the job, it would aggregate and calculate disruption for each invocation.

Jan 30 '24 16:01 martinkennelly

These tests can't be a part of the all suite afaict, they would need to be their own. Does this sound reasonable?

yes

Jan 30 '24 16:01 martinkennelly

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Apr 30 '24 01:04 openshift-bot

/remove-lifecycle stale

Apr 30 '24 08:04 martinkennelly

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Jul 29 '24 09:07 openshift-bot

/remove-lifecycle stale

Jul 29 '24 09:07 martinkennelly