descheduler
descheduler copied to clipboard
Proposal: Long-lived deployment: Triggering Descheduler with Events (SharedIndexInformer rules, endpoints, etc)
Is your feature request related to a problem? Please describe.
When I think about a problem I have that requires to take action when Descheduler does something my first target is one of the events that it triggers as follows: ^1
- New Node Joined
- Service Removed
- Deployment Delete
- HPA Update
- Trigger manually by calling and endpoint
- etc
(Just randomly filled these as an example, not meant to represent the use-cases.)
Describe the solution you'd like
To stay informed about when these events get triggered, we can use a primitive exposed by Kubernetes and the client-go called NewSharedIndexInformer, inside the cache package.
informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) { trigger() },
DeleteFunc: func(interface{}) { trigger() },
})
go informer.Run(...)
// onAdd is the function executed when the kubernetes informer notified the
// presence of a new kubernetes node in the cluster
func onAdd(obj interface{}) {
node := obj.(*corev1.Node)
...
}
Describe alternatives you've considered
In current architecture, node.go already handle this concern when we schedule the Descheduler to run as cron. But the motivation here, consider if we set the Descheduler to run every hour or once a week. If we delete the service, added new node, etc. there will be a time-window between event started - last time descheduler ran. Which means we have to wait hours or days to next run.
What version of descheduler are you using?
descheduler version: v0.22.1
Additional context
-
Hi @Dentrax, this is something we have toyed with before, so we're definitely interested in it. See the other issue https://github.com/kubernetes-sigs/descheduler/issues/489 and discussion in the matching PR at my first attempt to implement it here https://github.com/kubernetes-sigs/descheduler/pull/488
If there's any interest, feel free to read through some of the feedback there and pick this up as it's been on the back burner for a while. But, it would be great to get this rolling again.
So, if you don't mind, we can close this issue and keep the discussion for this feature in one spot (#489)
This is cool! Didn't notice that. Do you need any help on your PR? To lift & shift this feature, I can give a hand for this.
You mentioned runMode takes 2 options: Default and Informed. In this requirement, I thought we can pass something like Mixed option: Descheduler should start as Deployment for periodic runs. Additional to this, it should also watch Kubernetes events using SharedIndexInformer as well.
The motivation here is to make Descheduler run as scheduled as is + event watching with informers.
As a caveat, we can not run as cron anymore since event watching with informers requires long-lived process.
Hi ~ @damemi and @denkensk In my opinion, the strategys can be seprate with two types, one is related to pod, another is related to node. so is it only need to care for the podinformer and the nodeinformer will be better? since the HPA or deployment informer actually we only care for the pod and the node. like below
podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: runPodrelatedStrategy,
UpdateFunc: runPodrelatedStrategy,
DeleteFunc: runPodrelatedStrategy,
})
nodeInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: runNoderelatedStrategy,
DeleteFunc: runNoderelatedStrategy,
})
Then we can create one interface for all strategys, and they can have run() method for default mode, and infomered method for informerd mode.
and runPodrelatedStrategy and runNoderelatedStrategy will run as informer method
One another alternative way to trigger descheduler would be using custom endpoints as such: [GET] /api/v1/trigger
Kindly ping @damemi 🤞 (https://github.com/kubernetes-sigs/descheduler/pull/488#issuecomment-1017461730)
So sorry for the delay... I had actually started working on a response to this and got completely distracted!
To answer your questions:
You mentioned runMode takes 2 options: Default and Informed. In this requirement, I thought we can pass something like Mixed option: Descheduler should start as Deployment for periodic runs. Additional to this, it should also watch Kubernetes events using SharedIndexInformer as well.
@Dentrax this seems like a fine option for me. I'm not entirely sure why I originally made the run mode design mutually exclusive. I think this may bring up some challenges such as a timed strategy running at the same time as an informed strategy, but this could probably be handled with good multithreading practices.
In my opinion, the strategys can be seprate with two types, one is related to pod, another is related to node. so is it only need to care for the podinformer and the nodeinformer will be better? since the HPA or deployment informer actually we only care for the pod and the node. like below
@JaneLiuL this is the general idea, yeah. And I think that we should definitely initially start with a basic list of informers (like just pod and node, as you mentioned). However, the implementation details are where this gets tricky. Not all strategies will need all the informers, and in the future new strategies might need different informers.
Then we can create one interface for all strategys, and they can have run() method for default mode, and infomered method for informerd mode.
That's kind of how I was originally trying to address this need, by wrapping the existing default mode functions within a new interface. That interface can be defined with the generic informer functions that will be needed by all strategies, but the actual implementation of the interface can vary for each strategy by only calling upon the relevant informers.
One another alternative way to trigger descheduler would be using custom endpoints as such: [GET] /api/v1/trigger
@Dentrax this is an interesting idea, but not one that I think we should immediately pursue. I think it is safer to keep the descheduler reactive to objects and events that are known to the API server. allowing direct requests like this could bypass the need to know the "state" of the cluster and end up with conflicting, or unpredictable results.
Ultimately I'd like to revisit my implementation, or if there are alternatives/cherry-picks that others would like to work on please feel free. I think we are converging on a similar basic idea for the design that was started there
I’m concerned about how this would work out when adding multiple nodes. It might be nice to think about implementing a grace period. For example, if I were to add multiple nodes with a few seconds between each, it should wait for a while to ensure all nodes are in the equation.
An initial delay (e.g 30s) after each join event is a way to resolve this issue. Queuing them will not cut it for our use case, considering it’s common for us to add 40-50 nodes.
We could reset the countdown after each node join/update/delete event, and run the descheduler once the delay time of the last joined node is completed.
Hey @damemi, thanks for clarifying. One point that I might miss is that I want to pass Informed mode to entire DeschedulerPolicy. Additionally, configuring each DeschedulerStrategy to set StrategyRunMode individually would be great. Eventually, here is an example config that covers the idea:
runMode: bothrepresents policy will run on a periodic loop and handle informers to react immediately.- if we pass
runMode: defaultto any strategy, it overrides the policy's, and will not get triggered in case of informer events. (would it be overengineering?) - if we pass
runMode: informedto any strategy, it overrides the policy's, and will not get triggered on periodic loops. - if we didn't specify a
runModeto any strategy, strategy will use the policy's run mode. - Since #489 implements the
runModejust forRemovePodsViolatingNodeAffinityandRemovePodsViolatingNodeTaintsstrategies, I showed only two strategies. Should we want to add this run mode field for each strategy? - concern: if we plan new
runModes to add in the near future, usingbothtype would be shortsighted?
apiVersion: "descheduler/v1alpha1"
kind: "DeschedulerPolicy"
runMode: both
strategies:
"RemovePodsViolatingNodeAffinity":
enabled: true
runMode: default
"RemovePodsViolatingNodeTaints":
enabled: true
runMode: informed
At the end of the day, as @necatican pointed, we expect descheduler will run each strategy in case a new node joined/removed/updated, just like a cronjob.
That's a good point, I think a configurable batch-delay would help the informed strategies run more efficiently and effectively for use cases regarding high frequency periods of updates.
Regarding each strategy, I think this will still need to consider each strategy on a case-by-case basis. For simplicity and safety, I would prefer to roll this out in just one or two strategies to start (with the intent of adding more/all later) to give the design soak time. If we do that, then we won't need a policy-level runMode field until a significant portion of the strategies have implemented an informed option (at that time it can be easily added).
This would set us up for a more backwards-friendly development pattern (adding new fields is always backward-compatible, but if at some point we decide to change or remove the policy-level runMode this could break existing configs). wdyt?
I would prefer to roll this out in just one or two strategies to start (with the intent of adding more/all later) to give the design soak time.
+1, sounds good to me!
I think a configurable batch-delay would help the informed strategies run more efficiently
So we need to create a new batchTime field in duration format. And a timer variable in the main state that listen the informer events and resets after each new event occur. If timer get higher or equal than batchTime, descheduler should run the strategies?
batchTime: 30
|t2 - t1|: 15
|t3 - t2|: exceeds batch time
t t1 t2 t3
|----------------------------------------------> timeline
^ ^ ^
| | |
node1 joined node2 joined no nodes joined
(reset timer) (reset timer) (run strategies)
Some questions that might need some clarification:
- Should we create a new
informer{}struct to informer strategies and put some fields such asenabled,batchTime, etc. There may be some requirements for adding informer configs in the future. - Which events should we listen in order to reset the timer? (Join, delete, update, etc.)
- Should we allow passing
batchTimefor each strategy individually? If so, we might consider creating separated timers for each strategy instead of keeping one timer in the main state. In this case:
- Should we only run the strategy that where the timer exceeds? vs. running all informer strategies as follows:
Strategy1 --> batchTime: 10
Strategy2 --> batchTime: 30
|t2 - t1|: 15 --> Run **only** Strategy1
|t3 - t2|: 35 --> Run **only** Strategy2
vs.
|t2 - t1|: 15 --> Run **ALL** informed strategies
t t1 t2 t3
|----------------------------------------------> timeline
(same as above)
- What about field name:
batchTimevsbatchDelay?
PTAL @necatican
Which events should we listen in order to reset the timer? (Join, delete, update, etc.)
Join and leave events are the obvious ones for sure, we also should consider adding support for status changes and labels/taints.
Should we allow passing batchTime for each strategy individually? If so, we might consider creating separated timers for each strategy instead of keeping one timer in the main state. In this case:
This could be a nice feature but I can not think about an instance where this is necessary.
What about field name:
batchTimevsbatchDelay?
How about batchWindow or batchGracePeriod? batchTime is not explanatory enough. Also while we're using it to set a delay, the delay duration itself will not be equal to batchDelay, it could cause some confusion. damemi might have a better idea.
Should we consider adding a check to determine the maximum allotted time? What happens if changes keep occurring in a cluster?
Kind ping @damemi. Waiting your final command, boss. So we can get to it!
Sorry for the delay @Dentrax I think that this could be a good aspect of a potential refactor for the descheduler to make it more like a "descheduler framework". I'm trying to gather some feedback on that idea and how we can correlate a couple big projects like this under one goal. If you're interested, it would be great if you could add some input in the doc or issue here: https://github.com/kubernetes-sigs/descheduler/issues/753
Thank you so much for informing us! I'm definitely interested and want to take a look at that proposal.
On the other hand, we'd like to send a PoC PR for this proposal if you want. It's obvious that there will be needs we haven't covered/proposed here for sure. At least, we can continue to work on this idea and get feedback/review after submitting a PR.
/cc
@Dentrax sure thing, it never hurts to send a poc PR to get some more detail discussion rolling. Thanks!
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/staleis applied - After 30d of inactivity since
lifecycle/stalewas applied,lifecycle/rottenis applied - After 30d of inactivity since
lifecycle/rottenwas applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale - Mark this issue or PR as rotten with
/lifecycle rotten - Close this issue or PR with
/close - Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale /lifecycle frozen
Hey @damemi @ingvagabund,
Although we couldn't do a contribution to the Descheduler Framework migration, we followed closely what was done. Months passed since the initial proposal and internal code base has changed. We're running Deschedulers on high scale environments and still looking forward to this feature. Also, we want to get into it. In the discussion we had above, I mostly clarified my concerns about what and how we're going to do. The missing question is the where we are going to put logic in the current codebase.
Plugins directory is currently brand-new. I'm not sure if this is proposal counts as a plugin. I think it is something should be placed in execution level rather than plugin level. I'm a bit confused and lost here.
Can you please enlighten us about where should we write the actual logic in current codebase? Thanks!
cc'ing @eminaktas for awareness
@Dentrax thanks for bumping this. This is definitely one of the features we had in mind with the framework refactor.
You're right that this will be an execution-level change rather than a "plugin". But, it will still need to be extendable/modifiable in similar ways to the plugins (for example, writing a new strategy plugin that needs to be triggered on custom events). In the old codebase there wasn't a good mechanism for wiring this kind of information through to the strategies that need it.
For now, we should focus on migrating the existing code into the framework model so we don't overload ourselves. I think the big blocker for this will be migrating the internal code to a plugin-registry (right now the strategies are migrated to "plugins", but the internals are still basically just wrapping them in the old code). With that, we will be able to start working on an event registry that can be mapped to corresponding plugins.
If you want to put together a preliminary design for how this could work in the new codebase, please do! Having this discussion now, while not a primary focus, will still help inform design decisions for the rest of the framework migration.
Dropping a concern, so I don't forget! One small thing that worth to mention is that there could be slight time window between Node delete event and Descheduling cycle. There is a possibility to race condition between Scheduler and Descheduler:
An example case is:
- t0 = descheduler runs every node create/delete event
- t1 = two nodes are deleted
- t2 = Scheduler will try to schedule Pods to other nodes
- t2 = Descheduler will try to deschedule Pods to other nodes
My idea here is that we can listen for scheduler to check if any ongoing scheduling event. (Don't know how) As soon as Node gets deleted, we should NOT trigger the descheduler; instead we can wait awhile: all scheduling stuff gets done (all missing Pods up & running) + grace period time.
Does it make sense? 🤔