fix: experiment services deletion before reconciling traffic routing in rollouts
Problem Statement This PR addresses an intermittent issue observed in weighted experiments where Service objects associated with the experiments are prematurely deleted before their corresponding weight updates are fully propagated to the HTTPProxy resource. This race condition can result in temporary downtime or traffic misrouting.
Root Cause The issue stems from the parallel reconciliation of HTTPProxy and experiment Service resources by the Rollouts Controller and Experiments Controller, respectively. Specifically, it was observed that the Service objects are sometimes deleted before the HTTPProxy has been updated with the correct traffic weights. This lack of synchronization between the two controllers can cause a window where traffic is routed to non-existent backend services.
Proposed Solution To enforce proper ordering and avoid premature deletion, this PR leverages the existing spec.terminate flag in the Experiment spec. This flag is only set by the Rollouts Controller when it is safe to begin termination of an experiment. By adding a conditional check for spec.terminate == true before deleting experiment Service objects, we ensure that services are only removed after the rollout has finalized HTTPProxy updates and it is safe to do so. We have introduced a new flag to ensure that only experiments created by rollouts are affected.
This solution provides a lightweight and non-invasive mechanism to ensure the correct sequence of events, thereby preventing downtime during traffic shifting in weighted experiments.
Summary
- Ensures experiment services are deleted only when spec.terminate is set by the Rollouts Controller.
- Aligns service deletion with HTTPProxy updates to maintain traffic continuity.
- Fixes a race condition without introducing additional coordination overhead between controllers.
Checklist:
- [ ] Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
- [ ] The title of the PR is (a) conventional with a list of types and scopes found here, (b) states what changed, and (c) suffixes the related issues number. E.g.
"fix(controller): Updates such and such. Fixes #1234". - [ ] I've signed my commits with DCO
- [ ] I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
- [ ] My builds are green. Try syncing with master if they are not.
- [ ] My organization is added to USERS.md.
Published E2E Test Results
4 files 4 suites 3h 17m 52s ⏱️ 115 tests 105 ✅ 7 💤 3 ❌ 470 runs 432 ✅ 28 💤 10 ❌
For more details on these failures, see this check.
Results for commit 58cd30d2.
:recycle: This comment has been updated with latest results.
Published Unit Test Results
2 312 tests 2 312 ✅ 3m 1s ⏱️ 129 suites 0 💤 1 files 0 ❌
Results for commit 58cd30d2.
:recycle: This comment has been updated with latest results.
@Abhish2702 do you have logs of the controller when the problem occurs? I would like to see the evidence of the controller decisioning leading up to the outage.
@jessesuen could you please revisit the PR ? We agree with your solution that svc deletion should only check replicas availability.
The e2e tests (e.g. TestExperimentSuite/TestExperimentWithServiceNameAndScaleDownDelay) are failing because they expect the previous behavior where services are deleted immediately:
Experiment expectation 'experiment services count == 0' failed
The test will need to be adjusted so that it waits for available replicas reaching zero before checking.
@jessesuen I have fixed the approach and e2e
Quality Gate passed
Issues
0 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
12.7% Duplication on New Code