pytorch-operator icon indicating copy to clipboard operation
pytorch-operator copied to clipboard

[feature] Rethink distributed Pytorch backoff retry

Open czheng94 opened this issue 4 years ago • 6 comments

Currently, the backoff retries in all replicas are controlled separately. Either directly restarted by pod controller (restartPolicy will be propagated from ReplicaSpec to PodTemplateSpec) https://github.com/kubeflow/pytorch-operator/blob/037cd1b18eb77f657f2a4bc8a8334f2a06324b57/pkg/controller.v1/pytorch/pod.go#L283-L289 or restarted in the pytorchjob controller https://github.com/kubeflow/pytorch-operator/blob/037cd1b18eb77f657f2a4bc8a8334f2a06324b57/pkg/controller.v1/pytorch/pod.go#L91-L109

And the backOffLimit controls the sum of total failures in the replicas that should be restarted. https://github.com/kubeflow/pytorch-operator/blob/037cd1b18eb77f657f2a4bc8a8334f2a06324b57/pkg/controller.v1/pytorch/controller.go#L518-L556

I'm not sure how this separated backoff retry will help retrying distributed pytorch. Usually ppl will just do AllReduce across all replicas. If either a master or any worker fails, other pods are also going to fail all together because master process loses the connection to a worker and fails, and hence all workers lose connection to master. Then, the backOffLimit that controls the sum of total failures in the replicas becomes weird here. If there are 1 master and n workers, backOffLimit will need to be set to (1+n) * k, if we want the distributed pytorch job to restart k times.

I believe we should restart all replicas together, and use backOffLimit to control the number of all replica restarts.

czheng94 avatar Apr 21 '20 14:04 czheng94

Issue-Label Bot is automatically applying the labels:

Label Probability
feature 0.63

Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback! Links: app homepage, dashboard and code for this bot.

issue-label-bot[bot] avatar Apr 21 '20 14:04 issue-label-bot[bot]

/cc @johnugeorge

gaocegege avatar Apr 27 '20 02:04 gaocegege

@czheng94 I think elastics training is a hot topic this year. Ref https://github.com/pytorch/elastic

Thus I am not sure if we should do it.

gaocegege avatar Apr 27 '20 02:04 gaocegege

@gaocegege Do you mean change the distributed launch to use the new "elastic" way?

johnugeorge avatar Apr 27 '20 10:04 johnugeorge

@gaocegege elastic training is really cool!

I'm not very familiar with torch elastic (but interested to learn more). Are we thinking about supporting a separate operator for pytorch elastic like this one https://github.com/pytorch/elastic/blob/master/kubernetes/api/v1alpha1/elasticjob_types.go? Or are we going to integrate this into pytorch-operator?

In addition, I'm not sure if torch elastic is going to eventually replace current distributed pytorch. The latter seems to have more dependencies (etcd, user needing to change their code to always preserve the training state, user needing to tolerate varying batch size, etc.).

czheng94 avatar Apr 27 '20 14:04 czheng94

I wrote the pytorch ealstic controller in pytorch/elastic repo. Overall, it's very similar to existing operators. However, we can not use it seamlessly here because it requires a strong consistency store and the way to setup clusterEnv is a little bit different.

I suggest to use it separately at this moment. While, we should consider elasticity of all frameworks and fine a generate way to handle this and hide framework details

Jeffwan avatar May 18 '20 21:05 Jeffwan