training-operator [SDK] Use Elastic Policy and torchrun as a Default Entrypoint for PyTorchJob

Currently, we use python3 as an entrypoint to create Training Job using function: https://github.com/kubeflow/training-operator/blob/0b6a30cd348e101506b53a1a176e4a7aec6e9f09/sdk/python/kubeflow/training/utils/utils.py#L230 Since it is recommended to use the torchrun as an entrypoint to run distributed PyTorch, we should discuss if we need to change the entrypoint for PyTorchJob created from function. Also, we need to set the ElasticPolicy c10d backend.

We need to make sure that we can use torchrun with PyTorch code that is not using distributed capabilities.

cc @johnugeorge @tenzen-y @deepanker13 @kuizhiqing

Jan 16 '24 20:01 andreyvelich

The torchrun was introduced in PyTorch v1.10: https://github.com/pytorch/pytorch/releases/tag/v1.10.0 So if we switch to torchrun, we need to announce users that the new Python SDK doesn't support PyTorch<1.10.

Also, we need to set the ElasticPolicy c10d backend.

I'm not sure why we need to use ElasticPolicy as a default once we switch to the torchrun. Could you clarify?

Jan 17 '24 13:01 tenzen-y

@tenzen-y I think environment variables like PET_RDZV_ENDPOINT, PET_RDZV_BACKEND etc get set for the containers only when we pass the elastic policy spec (https://github.com/kubeflow/training-operator/blob/0b6a30cd348e101506b53a1a176e4a7aec6e9f09/pkg/controller.v1/pytorch/envvar.go#L109).

And the above mentioned environment variables are necessary to start multi node training as mentioned in https://github.com/pytorch/pytorch/blob/5bb2298da769121421711504da47955d3129b54f/torch/distributed/run.py#L164

.

Jan 17 '24 14:01 deepanker13

@tenzen-y I think environment variables like PET_RDZV_ENDPOINT, PET_RDZV_BACKEND etc get set for the containers only when we pass the elastic policy spec (

https://github.com/kubeflow/training-operator/blob/0b6a30cd348e101506b53a1a176e4a7aec6e9f09/pkg/controller.v1/pytorch/envvar.go#L109

). And the above mentioned environment variables are necessary to start multi node training as mentioned in https://github.com/pytorch/pytorch/blob/5bb2298da769121421711504da47955d3129b54f/torch/distributed/run.py#L164

.

It makes sense. Thank you for investigating this :)

Jan 25 '24 18:01 tenzen-y

Hi @andreyvelich @tenzen-y Has this issue been discussed clearly? If so can I take it and do some implementation?

Mar 11 '24 08:03 ckcd

Hi @ckcd, we haven't got a chance to discuss this issue in details yet. We need to identify pros and cons of using torchrun for all PyTorch-based tasks (e.g. single-node single-gpu, single-node multi-gpu, multi-node multi-gpu run).

Mar 11 '24 16:03 andreyvelich

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Jun 09 '24 20:06 github-actions[bot]

/remove lifecycle/stale /help

Jun 10 '24 12:06 andreyvelich

@andreyvelich: This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/remove lifecycle/stale /help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jun 10 '24 12:06 google-oss-prow[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sep 08 '24 20:09 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Sep 28 '24 20:09 github-actions[bot]

/good-first-issue

Sep 30 '24 15:09 andreyvelich

@andreyvelich: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to this:

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sep 30 '24 15:09 google-oss-prow[bot]

/assign @andreyvelich

Oct 02 '24 15:10 andreyvelich