mpi-operator Support target pod deletion in elastic training scenario

Support target pod deletion in elastic training scenario

Open Jeffwan opened this issue 2 years ago • 15 comments

Currently, as I understand, operator doesn't support to scale down specified pods. Replica delta won't give us the granular control. If we plan to use spot instance, when the instance goes down, we will received notification from node terminator. In that case, we want to scale in the workers on that node instead of sequentially or randomly remove some replicas to achieve the desired number.

One example. job has 4 workers scheduled on two different nodes.

NAME               READY   STATUS    RESTARTS   AGE    IP           NODE             NOMINATED NODE   READINESS GATES
worker-a 1/1     Running   0          29m   10.1.8.181    node1   <none>           <none>
worker-b 1/1     Running   0          29m   10.1.8.187   node1   <none>           <none>
worker-c 1/1     Running   0          29m   10.1.8.192   node2   <none>           <none>
worker-d 1/1     Running   0          29m   10.1.8.191   node2   <none>           <none>

When node2 is going to be terminated, system will receive the signal and controller component update job replicas to 3 and specific worker-c and worker-d should be terminated. mpi-operator will reconcile the job and remove the target workers.

NAME               READY   STATUS    RESTARTS   AGE    IP           NODE             NOMINATED NODE   READINESS GATES
worker-a 1/1     Running   0          29m   10.1.8.181    node1   <none>           <none>
worker-b 1/1     Running   0          29m   10.1.8.187   node1   <none>           <none>

/cc @kubeflow/wg-training-leads @zw0610 @Deliangfan

Aug 20 '21 05:08 Jeffwan

In the scenario you describe, I think the solution goes like this:

You cordon the node (this prevents new pods from landing there)
You drain the node (removes the pods and the node once all pods are gone)
The operator creates replacement pod that would land somewhere else.

Is this what you are looking for? Or do you want to only have 2 workers after you drain the node?

Aug 20 '21 14:08 alculquicondor

/assign @zw0610

I will add this feature into v1 controller.

Aug 20 '21 14:08 zw0610

@zw0610 can we discuss first what the use case is and the solution could look like? One question I foresee is how do we handle the indexes. So far all the indexes are continuous. Would we break this assumption? What is the API going to look like?

Also, the v2 controller is pretty much ready #373 (only missing documentation) so we probably need to implement this in v2 or both.

Aug 20 '21 14:08 alculquicondor

FYI, we recently introduced a similar feature in k8s ReplicaSet kubernetes/enhancements#2255

And there are some interesting discussions around it in the past week.

Aug 20 '21 15:08 alculquicondor

@alculquicondor

Is this what you are looking for? Or do you want to only have 2 workers after you drain the node?

That's not the same the case. You can think in some scenarios, we don't want to find replacement any more. If spot is recycled, usually entire VM market lack of resources. It's very hard to get new cheep resources. Technically, we want to scale in the job instead of having two tasks pending and keep waiting for resources.

I looked at PodDeletionCost feature. some other advanced StatefulSet supports scale in strategy in different ways. Technically, both work and I think we can find an easy way here. To me, pass pods to be deleted to the job sounds most straightforward way and we don't need that flexibility like upstream deployment

Aug 20 '21 16:08 Jeffwan

So we are talking about reducing the parallelism here. Can you answer the questions above then?

So far all the indexes are continuous. Would we break this assumption? What is the API going to look like?

Aug 20 '21 16:08 alculquicondor

To me, pass pods to be deleted to the job sounds most straightforward way and we don't need that flexibility like upstream deployment

When we are building an API, it's fine to start small to cover just the use case we already know of. But it's useful to think what other extensions could there be in the future and leave the options open for that. That's why I shared the discussion in kubernetes/enhancements#2255, so that we have a view of the problems that other people are facing and are trying to solve.

Aug 20 '21 17:08 alculquicondor

Also, the sequence of events that leads to the job getting downscaled and how that is communicated is not clear to me.

Aug 20 '21 20:08 ahg-g

@zw0610 can we discuss first what the use case is and the solution could look like? One question I foresee is how do we handle the indexes. So far all the indexes are continuous. Would we break this assumption? What is the API going to look like?

Also, the v2 controller is pretty much ready #373 (only missing documentation) so we probably need to implement this in v2 or both.

There is one thing we have to clarify first: these user cases are limited to Horovod Elastic distributed training tasks. In these cases, the index is meaningless.

@Jeffwan @DeliangFan Regarding the APIs, I would suggest two scenarios:

convert the desired deletion sequence into a string and update the annotation of the MPIJob with a specific key like "deletion-sequence": "mnist-horovod-elastic-4, mnist-horovod-elastic-2"
use the controller.kubernetes.io/pod-deletion-cost mentioned above to specify the order to delete pod

Aug 20 '21 23:08 zw0610

Even if the index is meaningless, we need to clarify the expectations to users. It's also useful for us to have a clear algorithm to manage the indexes so that we don't end up removing Pods that we shouldn't. For example, the current controller removes any pod that has an index greater than the number of replicas. Follow up question: what happens when the user decides to increase the number of replicas again? Do we start from the biggest number or do we fill up the gaps?

As for the API, we might want to re-use the same annotation as kubernetes (so, option 2) so that in the future we can migrate to k8s workload APIs.

Aug 23 '21 14:08 alculquicondor

Even if the index is meaningless, we need to clarify the expectations to users. It's also useful for us to have a clear algorithm to manage the indexes so that we don't end up removing Pods that we shouldn't. For example, the current controller removes any pod that has an index greater than the number of replicas. Follow up question: what happens when the user decides to increase the number of replicas again? Do we start from the biggest number or do we fill up the gaps?

As for the API, we might want to re-use the same annotation as kubernetes (so, option 2) so that in the future we can migrate to k8s workload APIs.

Good point. Here is a more detailed behavior expectation:

Assume an MPIJob is reconciled
When reconciling pods, if the (worker) replicas number is less than the existing pods, the reconciler will delete pods by the following order:

delete pods first according to the the value if key controller.kubernetes.io/pod-deletion-cost is found in pod annotation

delete remaining pods from max index to 0

If the (worker) replicas number is larger than the existing pods, the reconciler will fill the gap until the total number of pods matches the replicas.

Aug 24 '21 06:08 zw0610

When we are building an API, it's fine to start small to cover just the use case we already know of. But it's useful to think what other extensions could there be in the future and leave the options open for that. That's why I shared the discussion in kubernetes/enhancements#2255, so that we have a view of the problems that other people are facing and are trying to solve.

I was out of town and just come back. Agree. We should avoid one-way door. As I said, Both way work and there're tradeoffs. Like "controller.kubernetes.io/pod-deletion-cost" feature has assumption.

ReplicaSets. Only RS or workloads built on top of it can leverage the feature.
client accepts to update the pods and has the permission to update the pod instead of operate job itself. Normally, we don't want user to touch job underneath resources.
Scalability. clients may need to adjust up to all pods in order to delete some pods which bring lots of pressure to apiserver and this is not acceptable..
feature state. pod-deletion-cost is beta in 1.22. My company internally uses 1.18 and can not do upgrade in short term. We will probably move to 1.22 at least next June. I think some other companies may have similar cases.

In my option, it's hard to fulfill above requirements in most scenarios. That's why I said adding new fields to api is the most straightforward way. If native kubernetes workloads can support this case in near future, users can bump job APIVersion just like what we did for v2 and I think that's acceptable to most users.

Indexing is a separate story from the interface. The things we need to take care of are status update, host config updates, backoff etc. We probably need some deep dive to see if which way is better - starting from new or fill the gap

Aug 25 '21 05:08 Jeffwan

Clarifying on (1) and (4): I wasn't suggesting we "use" the existing pod-deletion-cost feature as it's only available for RS in 1.22+. I was just suggesting we use the same API so that in the future we could migrate the worker Pods management to a k8s workload API with less disruption to users.

In any case, I think you are presenting fair points in favor of an API at the MPIJob level. The next question is whether we want an annotation or a field.

I think it's worth having a proposal in the proposals/ folder (or perhaps update the existing one for elastic support). Can someone work on that?

Aug 25 '21 13:08 alculquicondor

Got you.

I think it's worth having a proposal in the proposals/ folder (or perhaps update the existing one for elastic support). Can someone work on that?

Agree. A lot of details need to be finalized. Seems @zw0610 is interested in this story and let's see if he wants to drive the efforts. I can give some help if needed.

Aug 25 '21 16:08 Jeffwan

Sure. I shall update the elastic proposal in mpi-operator.

Aug 26 '21 03:08 zw0610

mpi-operator mpi-operator copied to clipboard

Support target pod deletion in elastic training scenario

mpi-operator
mpi-operator copied to clipboard