karmada
karmada copied to clipboard
support elastic cluster
there are two k8s cluster in our company: k8s in idc, tke/k8s in cloud company(like tencent cloud). we use cluster autoscaler in tke, if there are pending pod, tke will scale up a new node into cluster. so we can see the tke cluster is elastic.
in this scenario, our scheduler requirements is
- the pods of a deployment can only be deployed in one cluster
- the priority of idc is bigger than tke, if the idc have the enough resources, we deploy the deployment in idc first.
- if there is no free resource in idc, and no free resources in tke(such as 10 nodes at the moment). we deploy the deployment in tke so that the pending pod can trigge the scaling up.
we may need to take elastic cluster into consideration.
It's a reasonable use case for me. And I think it requires two abilities:
- Set cluster priority for the scheduler to prefer to cluster(s) in IDC
- Give the
scalable clusterlower priority but always with sufficient resources.
@Garrybest any idea about the case and ways to achieve that?
-
Scheduling based on priority We have already designed a API here https://github.com/karmada-io/karmada/pull/842, when some clusters have high priority, karmada prefer to divide replicas between these clusters.
-
Max clusters Now we can't set max clusters when scheduling. In your scenario, you don't want workloads separated to multiple clusters. Actually, I think this is a really common case since many workloads in offline computing, like AI or big data, are not allowed to be divided in multiple clusters. It would be nice in the future to add a
MaxClusterfield in API. -
Cluster autoscaling This is really interesting. Unfortunately we care about how many available replicas in dynamic scheduling, if a cluster could not accommodate so many replicas, the scheduler will fail to schedule them and publish an error event, this is a little bit like
gang-scheduling. So I think we could introduce aMinReadyReplicasAPI to tell scheduler in which circumstances that a workload could be scheduled when resource is not enough.
@Garrybest my department is machine learning platforms. and my workload is tfjob/pytorchjob/vcjob(volcano).
is it useful to make sure that the workload can not be separated to multiple clusters?
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
name: example-policy
spec:
resourceSelectors:
- apiVersion: apps/v1
kind: Deployment
placement:
clusterAffinity:
clusterNames:
- member1
- member2
spreadConstraints: ## 限定调度结果只能在 1 个cluster 上
- spreadByField: cluster
maxGroups: 1
minGroups: 1
in my opinion, it may be better to add a filed for cluster.spec, such as elastic=true or backup=true, and change the logic of scheduler: if there is no cluster which has enough resources to run the workload, we can use backup/elastic cluster to run it , no matter if it has resources or not.
Is it useful to make sure that the workload can not be separated to multiple clusters?
Yes, but notice that it does not take available replicas as consideration.
In my opinion, it may be better to add a filed for cluster.spec, such as elastic=true or backup=true
I'm afraid this field is not abstract enough and focuses a specified scenario. We prefer to normalize some features to benefit other scenarios.
Yes, but notice that it does not take available replicas as consideration.
After #842, people can have their customized filter plugin to deny the cluster in the low available resources.
For instance, the customized filter plugin can look at the available resources, once the local cluster(in IDC) no enough resources left, it can prevent the local cluster from scheduling results, then the scheduler will tend back to the cluster in public.
How do you think? @Garrybest @qiankunli
I think it could work.
Hi @RainbowMango , The scenario is exactly the same as in our lab. Our lab has certain GPU resources, but sometimes, when more computing resources are needed, such as when a project is launched, I prefer to use a public cloud like Alibaba's ASM cluster as an elastic cluster so that TFJob and TorchJob in Volcano can run on the public cloud.
I feel that this scenario is quite common and it is also my research topic. If there is an opportunity, I would also like to contribute to the Karmada project.
You can assign this issue to me, and I'll try to solve this.
Hi @chengleqi, @qiankunli We just finished the multi-affinity-group feature in the latest v1.5.0. With this feature, we can declare the cluster preference now, for example, we prefer to the cluster in local IDC:
apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
name: nginx
spec:
resourceSelectors:
- apiVersion: apps/v1
kind: Deployment
name: nginx
placement:
clusterAffinities:
- affinityName: local-idc
clusterNames:
- local-idc-01
- affinityName: public-cloud
clusterNames:
- public-cloud-01
With this specification, Karmada would tend to the public-cloud only in case of local idc's cluster is unavailable.
I feel that this scenario is quite common and it is also my research topic. If there is an opportunity, I would also like to contribute to the Karmada project.
Thanks in advance. The next thing we need to do is we need a way to evaluate the resources on public clusters, especially for those with the capacity of auto-scalling.
Hi @chengleqi, @qiankunli We just finished the multi-affinity-group feature in the latest v1.5.0. With this feature, we can declare the cluster preference now, for example, we prefer to the cluster in local IDC:
apiVersion: policy.karmada.io/v1alpha1 kind: PropagationPolicy metadata: name: nginx spec: resourceSelectors: - apiVersion: apps/v1 kind: Deployment name: nginx placement: clusterAffinities: - affinityName: local-idc clusterNames: - local-idc-01 - affinityName: public-cloud clusterNames: - public-cloud-01With this specification, Karmada would tend to the public-cloud only in case of local idc's cluster is unavailable.
Sounds great, I'll go try it out first.
I'm trying to figure out which PR/Issue should be included in the coming v1.7 release which is planned at the end of this month. I guess we don't have enough time for this feature, so I'm moving this to v1.8.
/subscribe
This feature would be very awesome to have!
@maaft Thanks for the spotting! Yeah, I like this feature too. Unfortunately, we don't have enough time on this in this release. Let's see if we can plan it in the next release. @maaft Do you want to get involved with this feature?
@RainbowMango is it already decided upon how to implement this? If yes and you can summarize what needs to be done, I can give it a shot.
What speaks against having boolean flags for all resource types?
kind: Cluster
spec:
elasticResources:
gpu:
min: 0
max: 20
What about rescheduling in case the first cluster has some free resources again?
is it already decided upon how to implement this?
Not yet, we need someone to step out and lead the effort to work out a proposal.
What speaks against having boolean flags for all resource types?
It makes sense to me that specifying resource names, that can help the scheduler make the right decisions.
What about rescheduling in case the first cluster has some free resources again?
I guess that's another topic about how to rebalance resources between clusters. It might depend on if we can declare user expectations in PropagationPolicy. I feel this is another feature.