[SDK] Use Elastic Policy and torchrun as a Default Entrypoint for PyTorchJob
Currently, we use python3 as an entrypoint to create Training Job using function: https://github.com/kubeflow/training-operator/blob/0b6a30cd348e101506b53a1a176e4a7aec6e9f09/sdk/python/kubeflow/training/utils/utils.py#L230
Since it is recommended to use the torchrun as an entrypoint to run distributed PyTorch, we should discuss if we need to change the entrypoint for PyTorchJob created from function.
Also, we need to set the ElasticPolicy c10d backend.
We need to make sure that we can use torchrun with PyTorch code that is not using distributed capabilities.
cc @johnugeorge @tenzen-y @deepanker13 @kuizhiqing
The torchrun was introduced in PyTorch v1.10: https://github.com/pytorch/pytorch/releases/tag/v1.10.0
So if we switch to torchrun, we need to announce users that the new Python SDK doesn't support PyTorch<1.10.
Also, we need to set the ElasticPolicy c10d backend.
I'm not sure why we need to use ElasticPolicy as a default once we switch to the torchrun.
Could you clarify?
@tenzen-y I think environment variables like PET_RDZV_ENDPOINT, PET_RDZV_BACKEND etc get set for the containers only when we pass the elastic policy spec (https://github.com/kubeflow/training-operator/blob/0b6a30cd348e101506b53a1a176e4a7aec6e9f09/pkg/controller.v1/pytorch/envvar.go#L109).
And the above mentioned environment variables are necessary to start multi node training as mentioned in https://github.com/pytorch/pytorch/blob/5bb2298da769121421711504da47955d3129b54f/torch/distributed/run.py#L164
.
@tenzen-y I think environment variables like PET_RDZV_ENDPOINT, PET_RDZV_BACKEND etc get set for the containers only when we pass the elastic policy spec (
https://github.com/kubeflow/training-operator/blob/0b6a30cd348e101506b53a1a176e4a7aec6e9f09/pkg/controller.v1/pytorch/envvar.go#L109
). And the above mentioned environment variables are necessary to start multi node training as mentioned in https://github.com/pytorch/pytorch/blob/5bb2298da769121421711504da47955d3129b54f/torch/distributed/run.py#L164
.
It makes sense. Thank you for investigating this :)
Hi @andreyvelich @tenzen-y Has this issue been discussed clearly? If so can I take it and do some implementation?
Hi @ckcd, we haven't got a chance to discuss this issue in details yet.
We need to identify pros and cons of using torchrun for all PyTorch-based tasks (e.g. single-node single-gpu, single-node multi-gpu, multi-node multi-gpu run).
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/remove lifecycle/stale /help
@andreyvelich: This request has been marked as needing help from a contributor.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.
In response to this:
/remove lifecycle/stale /help
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.
/good-first-issue
@andreyvelich: This request has been marked as suitable for new contributors.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.
In response to this:
/good-first-issue
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/assign @andreyvelich