karmada icon indicating copy to clipboard operation
karmada copied to clipboard

support elastic cluster

Open qiankunli opened this issue 3 years ago • 15 comments

there are two k8s cluster in our company: k8s in idc, tke/k8s in cloud company(like tencent cloud). we use cluster autoscaler in tke, if there are pending pod, tke will scale up a new node into cluster. so we can see the tke cluster is elastic.

in this scenario, our scheduler requirements is

  1. the pods of a deployment can only be deployed in one cluster
  2. the priority of idc is bigger than tke, if the idc have the enough resources, we deploy the deployment in idc first.
  3. if there is no free resource in idc, and no free resources in tke(such as 10 nodes at the moment). we deploy the deployment in tke so that the pending pod can trigge the scaling up.

we may need to take elastic cluster into consideration.

qiankunli avatar Sep 27 '22 06:09 qiankunli

It's a reasonable use case for me. And I think it requires two abilities:

  • Set cluster priority for the scheduler to prefer to cluster(s) in IDC
  • Give the scalable cluster lower priority but always with sufficient resources.

@Garrybest any idea about the case and ways to achieve that?

RainbowMango avatar Sep 28 '22 03:09 RainbowMango

  1. Scheduling based on priority We have already designed a API here https://github.com/karmada-io/karmada/pull/842, when some clusters have high priority, karmada prefer to divide replicas between these clusters.

  2. Max clusters Now we can't set max clusters when scheduling. In your scenario, you don't want workloads separated to multiple clusters. Actually, I think this is a really common case since many workloads in offline computing, like AI or big data, are not allowed to be divided in multiple clusters. It would be nice in the future to add a MaxCluster field in API.

  3. Cluster autoscaling This is really interesting. Unfortunately we care about how many available replicas in dynamic scheduling, if a cluster could not accommodate so many replicas, the scheduler will fail to schedule them and publish an error event, this is a little bit like gang-scheduling. So I think we could introduce a MinReadyReplicas API to tell scheduler in which circumstances that a workload could be scheduled when resource is not enough.

Garrybest avatar Sep 28 '22 06:09 Garrybest

@Garrybest my department is machine learning platforms. and my workload is tfjob/pytorchjob/vcjob(volcano).

is it useful to make sure that the workload can not be separated to multiple clusters?

apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: example-policy
spec:
  resourceSelectors:
    - apiVersion: apps/v1
      kind: Deployment
  placement:
    clusterAffinity:
      clusterNames:
        - member1
        - member2
    spreadConstraints:	## 限定调度结果只能在 1 个cluster 上
    - spreadByField: cluster
      maxGroups: 1
      minGroups: 1

in my opinion, it may be better to add a filed for cluster.spec, such as elastic=true or backup=true, and change the logic of scheduler: if there is no cluster which has enough resources to run the workload, we can use backup/elastic cluster to run it , no matter if it has resources or not.

qiankunli avatar Sep 28 '22 09:09 qiankunli

Is it useful to make sure that the workload can not be separated to multiple clusters?

Yes, but notice that it does not take available replicas as consideration.

In my opinion, it may be better to add a filed for cluster.spec, such as elastic=true or backup=true

I'm afraid this field is not abstract enough and focuses a specified scenario. We prefer to normalize some features to benefit other scenarios.

Garrybest avatar Sep 30 '22 06:09 Garrybest

Yes, but notice that it does not take available replicas as consideration.

After #842, people can have their customized filter plugin to deny the cluster in the low available resources.

For instance, the customized filter plugin can look at the available resources, once the local cluster(in IDC) no enough resources left, it can prevent the local cluster from scheduling results, then the scheduler will tend back to the cluster in public.

How do you think? @Garrybest @qiankunli

RainbowMango avatar Oct 24 '22 07:10 RainbowMango

I think it could work.

Garrybest avatar Oct 25 '22 08:10 Garrybest

Hi @RainbowMango , The scenario is exactly the same as in our lab. Our lab has certain GPU resources, but sometimes, when more computing resources are needed, such as when a project is launched, I prefer to use a public cloud like Alibaba's ASM cluster as an elastic cluster so that TFJob and TorchJob in Volcano can run on the public cloud.

I feel that this scenario is quite common and it is also my research topic. If there is an opportunity, I would also like to contribute to the Karmada project.

You can assign this issue to me, and I'll try to solve this.

chengleqi avatar Mar 01 '23 11:03 chengleqi

Hi @chengleqi, @qiankunli We just finished the multi-affinity-group feature in the latest v1.5.0. With this feature, we can declare the cluster preference now, for example, we prefer to the cluster in local IDC:

apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: nginx
spec:
  resourceSelectors:
    - apiVersion: apps/v1
      kind: Deployment
      name: nginx
  placement:
    clusterAffinities:
      - affinityName: local-idc
        clusterNames:
          - local-idc-01
      - affinityName: public-cloud
        clusterNames:
          - public-cloud-01

With this specification, Karmada would tend to the public-cloud only in case of local idc's cluster is unavailable.

RainbowMango avatar Mar 02 '23 14:03 RainbowMango

I feel that this scenario is quite common and it is also my research topic. If there is an opportunity, I would also like to contribute to the Karmada project.

Thanks in advance. The next thing we need to do is we need a way to evaluate the resources on public clusters, especially for those with the capacity of auto-scalling.

RainbowMango avatar Mar 02 '23 14:03 RainbowMango

Hi @chengleqi, @qiankunli We just finished the multi-affinity-group feature in the latest v1.5.0. With this feature, we can declare the cluster preference now, for example, we prefer to the cluster in local IDC:

apiVersion: policy.karmada.io/v1alpha1
kind: PropagationPolicy
metadata:
  name: nginx
spec:
  resourceSelectors:
    - apiVersion: apps/v1
      kind: Deployment
      name: nginx
  placement:
    clusterAffinities:
      - affinityName: local-idc
        clusterNames:
          - local-idc-01
      - affinityName: public-cloud
        clusterNames:
          - public-cloud-01

With this specification, Karmada would tend to the public-cloud only in case of local idc's cluster is unavailable.

Sounds great, I'll go try it out first.

chengleqi avatar Mar 02 '23 14:03 chengleqi

I'm trying to figure out which PR/Issue should be included in the coming v1.7 release which is planned at the end of this month. I guess we don't have enough time for this feature, so I'm moving this to v1.8.

RainbowMango avatar Aug 25 '23 07:08 RainbowMango

/subscribe

This feature would be very awesome to have!

maaft avatar Nov 22 '23 10:11 maaft

@maaft Thanks for the spotting! Yeah, I like this feature too. Unfortunately, we don't have enough time on this in this release. Let's see if we can plan it in the next release. @maaft Do you want to get involved with this feature?

RainbowMango avatar Nov 23 '23 03:11 RainbowMango

@RainbowMango is it already decided upon how to implement this? If yes and you can summarize what needs to be done, I can give it a shot.

What speaks against having boolean flags for all resource types?

kind: Cluster
spec:
   elasticResources:
      gpu:
         min: 0
         max: 20

What about rescheduling in case the first cluster has some free resources again?

maaft avatar Nov 23 '23 05:11 maaft

is it already decided upon how to implement this?

Not yet, we need someone to step out and lead the effort to work out a proposal.

What speaks against having boolean flags for all resource types?

It makes sense to me that specifying resource names, that can help the scheduler make the right decisions.

What about rescheduling in case the first cluster has some free resources again?

I guess that's another topic about how to rebalance resources between clusters. It might depend on if we can declare user expectations in PropagationPolicy. I feel this is another feature.

RainbowMango avatar Nov 23 '23 09:11 RainbowMango