Live migration: dont rollback within the same test as migration
After merging the code and applying the jobs for migration and rollback, we noticed that monitors have a toleration for disruption. This toleration was configured for upgrade. For our job, we perform migration and rollback. Its expected the disruption time will be larger.
Instead of configuring the monitors to have a custom disruption duration, simply call the test twice for migration and rollback and have new instances of the monitors.
This PR performs the migration of the CNI when called. Within the CI jobs, we can simply call this test twice to perform migration and rollback.
A point to note is that this test is active within the 'all' suite.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: martinkennelly Once this PR has been reviewed and has the lgtm label, please assign bparees for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
@martinkennelly: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:
| Test name | Commit | Details | Required | Rerun command |
|---|---|---|---|---|
| ci/prow/e2e-openstack-ovn | 7c1388cc425233124f0c621d935a2cc25607d1b5 | link | false | /test e2e-openstack-ovn |
| ci/prow/e2e-agnostic-ovn-cmd | 7c1388cc425233124f0c621d935a2cc25607d1b5 | link | false | /test e2e-agnostic-ovn-cmd |
| ci/prow/e2e-aws-ovn-single-node-serial | 7c1388cc425233124f0c621d935a2cc25607d1b5 | link | false | /test e2e-aws-ovn-single-node-serial |
| ci/prow/e2e-gcp-ovn-rt-upgrade | 7c1388cc425233124f0c621d935a2cc25607d1b5 | link | false | /test e2e-gcp-ovn-rt-upgrade |
| ci/prow/e2e-aws-ovn-single-node | 7c1388cc425233124f0c621d935a2cc25607d1b5 | link | false | /test e2e-aws-ovn-single-node |
| ci/prow/e2e-aws-ovn-single-node-upgrade | 7c1388cc425233124f0c621d935a2cc25607d1b5 | link | false | /test e2e-aws-ovn-single-node-upgrade |
Full PR test history. Your PR dashboard.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.
@dgoodwin PTAL - I need advice regarding this test now being included in the "all" suite. It will migrate from OVN-K to SDN which is .. weird. Id rather disable it for the all suite and only allow the exec of this test when the migration suite is called specifically.
I checked and its not selected for any other suite.
Can you paste links to job runs where you are noticing increased disruption? I would like to see how serious the delta is. Does the team feel it's acceptable to be taking this increased hit in disruption during this migration? Is it a bug that actually should be fixed essentially.
This is quite complicated, trying to figure out what to do. There are no tolerations per se, we just collect data and then slice on various dimensions, but there is no dimension for ovn->sdn migrations. Creating the monitors multiple times is unlikely to help as the data is still aggregated for the job run and submitted to the bigquery database, which later feeds into the limits the tests will fail on. That is likely the toleration you are referring to. We might have to explicitly exclude all your results based on a job naming convention.
Jumping ahead, I think we need to find a way to (a) disable the disruption testing for your job based on something we can identify in https://github.com/openshift/origin/blob/master/pkg/monitortestlibrary/disruptionlibrary/disruption_invariant_adapter.go#L224 to skip those tests.
This will allow us to still collect data for your job so we could answer the question, how much disruption would a customer expect going through this process.
Then (b) we need to filter our your jobs via a consistent naming scheme in our scheduled queries that report disruption data daily. This will keep your results out of the dashboard and regression detection.
These tests can't be a part of the all suite afaict, they would need to be their own. Does this sound reasonable?
Can you paste links to job runs where you are noticing increased disruption?
https://prow.ci.openshift.org/job-history/gs/test-platform-results/pr-logs/directory/pull-ci-openshift-cluster-network-operator-master-e2e-aws-live-migration-sdn-ovn-rollback
Does the team feel it's acceptable to be taking this increased hit in disruption during this migration?
This increased disruption sometimes happens (1/4 or 1/5) when we migrate to a CNI and then rollback. So essentially its two migrations. We do not see it during a single migration. All is good there.
Creating the monitors multiple times is unlikely to help as the data is still aggregated for the job run and submitted to the bigquery database, which later feeds into the limits the tests will fail on.
I see. Thanks for this info. I thought every time we exec openshift-tests run .. within the job, it would aggregate and calculate disruption for each invocation.
These tests can't be a part of the all suite afaict, they would need to be their own. Does this sound reasonable?
yes
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.
If this issue is safe to close now please do so with /close.
/lifecycle stale
/remove-lifecycle stale