pytorch-operator
pytorch-operator copied to clipboard
[feature] Rethink distributed Pytorch backoff retry
Currently, the backoff retries in all replicas are controlled separately.
Either directly restarted by pod controller (restartPolicy
will be propagated from ReplicaSpec
to PodTemplateSpec
)
https://github.com/kubeflow/pytorch-operator/blob/037cd1b18eb77f657f2a4bc8a8334f2a06324b57/pkg/controller.v1/pytorch/pod.go#L283-L289
or restarted in the pytorchjob controller
https://github.com/kubeflow/pytorch-operator/blob/037cd1b18eb77f657f2a4bc8a8334f2a06324b57/pkg/controller.v1/pytorch/pod.go#L91-L109
And the backOffLimit controls the sum of total failures in the replicas that should be restarted. https://github.com/kubeflow/pytorch-operator/blob/037cd1b18eb77f657f2a4bc8a8334f2a06324b57/pkg/controller.v1/pytorch/controller.go#L518-L556
I'm not sure how this separated backoff retry will help retrying distributed pytorch. Usually ppl will just do AllReduce across all replicas. If either a master or any worker fails, other pods are also going to fail all together because master process loses the connection to a worker and fails, and hence all workers lose connection to master. Then, the backOffLimit that controls the sum of total failures in the replicas becomes weird here. If there are 1 master and n workers, backOffLimit will need to be set to (1+n) * k, if we want the distributed pytorch job to restart k times.
I believe we should restart all replicas together, and use backOffLimit to control the number of all replica restarts.
Issue-Label Bot is automatically applying the labels:
Label | Probability |
---|---|
feature | 0.63 |
Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.
/cc @johnugeorge
@czheng94 I think elastics training is a hot topic this year. Ref https://github.com/pytorch/elastic
Thus I am not sure if we should do it.
@gaocegege Do you mean change the distributed launch to use the new "elastic" way?
@gaocegege elastic training is really cool!
I'm not very familiar with torch elastic (but interested to learn more). Are we thinking about supporting a separate operator for pytorch elastic like this one https://github.com/pytorch/elastic/blob/master/kubernetes/api/v1alpha1/elasticjob_types.go? Or are we going to integrate this into pytorch-operator?
In addition, I'm not sure if torch elastic is going to eventually replace current distributed pytorch. The latter seems to have more dependencies (etcd, user needing to change their code to always preserve the training state, user needing to tolerate varying batch size, etc.).
I wrote the pytorch ealstic controller in pytorch/elastic
repo. Overall, it's very similar to existing operators. However, we can not use it seamlessly here because it requires a strong consistency store and the way to setup clusterEnv is a little bit different.
I suggest to use it separately at this moment. While, we should consider elasticity of all frameworks and fine a generate way to handle this and hide framework details