cli-utils
                                
                                 cli-utils copied to clipboard
                                
                                    cli-utils copied to clipboard
                            
                            
                            
                        Bug: dependency waiting blocks too aggressively
Problems
The applier/destroyer pipelines add Apply/Prune tasks with Wait tasks, waiting for each apply/prune group to become Current (after apply) or NotFound (after prune). This behavior coupled with complex resource sets with multiple dependency branches has the following impact:
- apply/prune can be blocked for some object even if their dependencies are ready
- blocking waits for the slowest reconciliation in the previous apply phase, even if it doesn't time out
- because wait timeout currently causes the pipeline to terminate, any object that cause wait timeout blocks all objects in subsequent phases from being applied/pruned
Example 1
For example, here's an infra dependency chain with branches (just deploying two GKEs with a CC cluster):
- namespace
- rbac
- GKE cluster1
- node pool1 (depends-on cluster1)
 
- GKE cluster2
- node pool2 (depends-on cluster2)
 
 
If any cluster fails to apply (ex: error in config), then both node pools are blocked.
Example 2
Another example is just using the same apply for multiple namespaces or CRDs and namespace:
- CRD
- namespace1
- deployment1
 
- namespace2
- deployment2
 
If any CRD or any namespace fails to apply (ex: blocked by policy webhook or config error), then all the deployments and everything in the namespaces are blocked.
Possible solutions
Continue on wait timeout
- This helps with problem 1 and 3, but not problem 2, and has the consequence of making failure take longer, because all resources will be applied, even if we know their dependency isn't ready.
Dependency filter
- This would help remediate the side effect of "continue on wait timeout" by skipping apply/prune for objects with unreconciled dependencies
- This would need to be used on both apply and prune
- This would need to share logic with the graph sorter, which currently handles identifying dependencies (depends-on, apply-time-mutation, CRDs, and namespaces)
Parallel apply
- Applying objects in parallel during each Apply Task (and deleting in parallel for prune tasks) would speed up e2e apply time significantly, helping to mitigate dependency cost, but not actually solving either problem 1 or problem 2
Async (graph) apply
- Building a branching pipeline, instead of flattening into a single synchronous pipeline would help isolate dependencies to only block things that depend on them.
- This is probably the best solution, but also the most complex and risky.
- This would probably require changing how events are tested/used. Consuming them in a single for loop might not be the best strategy any more; async listeners might be better.
- This would make both Dependency filter and Parallel apply obsolete.
- This might also make Continue on wait timeout obsolete, because there might be more than one task branch executing and one could terminate even if the others continue.
- This is the only proposal that solve problem 2, which is going to become a bigger problem as soon as people start using depends-on and apply-time-mutation on infra resources (like GKE clusters) that take a long time to reconcile.
- This solution might make it easier to add support for advanced features like blocking on Job completion or lifecycle directives (ex: delete before apply)
@mortent @haiyanmeng @seans3 I've tried to document the motivation behind these proposals we discussed, but haven't gone into a lot of detail about how they might work.
Should we tackle all of them in top down order? Or should we skip straight to the hardest one which makes the others obsolete? Got other ideas that would resolve the same issues?
So I don't think we should do "Continue on wait timeout" (where we essentially continue applying all resources after a timeout) as the default at all. It effectively breaks depends-on and apply-time mutations.
The dependency filter seems like a good temporary fix if we feel async apply is complex. It should have correct behavior, although there are some drawbacks.
The parallel apply is ok if we think it significantly improves performance. Otherwise I'm not convinced it is worth the effort.
Long term I think the async apply is the best solution. It does address several issues with both behavior and performance. It probably needs a design doc to get into some more detail about how it should work.
Morten is wrong, and we should "Continue on wait timeout". While it does not work for depends-on and apply-time mutations, these are ALREADY BROKEN. We already continue on error--just not for wait timeouts. This change would NOT make the current situation worse. But it will be strictly better than the current task execution stoppage for a wait timeout. And it will fix the long-standing bug of not running the final inventory task on wait timeout.
The correct error handling for depends on and apply time mutation is something we should work for. But in the short term "continuing on wait timeout" is clearly better than the current state. I will be merging it shortly.
As far as the other priorities, I believe the async apply, since it will the the longest and hardest should not be our first effort.
So the PR in question doesn't make it the default, so I'm ok with that change. It has another issue around how we use the error event that I have pointed out on the PR.
I don't disagree that some of the behavior is broken today, I just think we should do the more permanent (since we all agree the async is a bigger change) fix with the dependency filter. The current behavior has been that way for a while, so I think we can take the time to fix it properly.
Per discussion elsewhere, we decided to change the behavior with https://github.com/kubernetes-sigs/cli-utils/pull/451 to always continue on wait timeouts (and other apply/destory errors, as was previous behavior).
This is a change in behavior, but the previous behavior was considered broken (or at least internally inconsistent).
Work on Dependency Filter and Parallel Apply should be unblocked. Work on them is mostly a question of priority. It sounds like there's disagreement on whether to do those incremental changes first or just skip to the Async Apply.
Status update...
Continue on wait timeout has already been implemented:
- https://github.com/kubernetes-sigs/cli-utils/pull/451
- https://github.com/kubernetes-sigs/cli-utils/pull/493
We also want to add an optional Continue on Validation Error feature as part of the WIP validation redesign: https://github.com/kubernetes-sigs/cli-utils/pull/488
Dependency Filter is planned and will become more urgent once we have Continue on Validation Error.
Parallel Apply is also planned (Q1 2022), but is primarily a speed concern, not a correctness concern.
Async (graph) apply will also hopefully be eventually implemented, because it solves multiple problems, but it hasn't been scheduled yet, and may take a long time to get right.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity, lifecycle/staleis applied
- After 30d of inactivity since lifecycle/stalewas applied,lifecycle/rottenis applied
- After 30d of inactivity since lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with /remove-lifecycle stale
- Mark this issue or PR as rotten with /lifecycle rotten
- Close this issue or PR with /close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
Remaining fixes are still TODO
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity, lifecycle/staleis applied
- After 30d of inactivity since lifecycle/stalewas applied,lifecycle/rottenis applied
- After 30d of inactivity since lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with /remove-lifecycle stale
- Mark this issue or PR as rotten with /lifecycle rotten
- Close this issue or PR with /close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity, lifecycle/staleis applied
- After 30d of inactivity since lifecycle/stalewas applied,lifecycle/rottenis applied
- After 30d of inactivity since lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with /remove-lifecycle rotten
- Close this issue or PR with /close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
/remove-lifecycle rotten /lifecycle frozen
FWIW, the dependency filter was added. Also, now that QPS has been disabled there’s not much value in adding parallel apply unless you want to add even more server load. The ideal solution will be to eventually rewrite the apply scheduler to use an asynchronous graph.
Unfortunately, it’s not high on the priority list yet, because of other issues. But hopefully we’ll get around to it soon.