community
community copied to clipboard
feat(pytorch): Add elastic proposal
Signed-off-by: cegao [email protected]
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: gaocegege
To complete the pull request process, please assign james-jwu after the PR has been reviewed.
You can assign the PR to them by writing /assign @james-jwu
in a comment when ready.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
/cc @kubeflow/wg-training-leads @alculquicondor @zw0610
Nice proposal. I will leave some comments tomorrow.
We may also have this problem: https://github.com/pytorch/pytorch/issues/65992
We should consider it in this proposal.
�We may also have this problem: pytorch/pytorch#65992
We should consider it in this proposal.
Could we have a "environment variable list for PyTorch distributed training"? Then we can further exploit whether some fields could be merged and how to modify the replica spec.
Could we have a "environment variable list for PyTorch distributed training"? Then we can further exploit whether some fields could be merged and how to modify the replica spec.
Sure, of course. Working on it.
// Rendezvous related arguments
// EnvRDZVBackend is the environment variable name for the rdzv backend.
EnvRDZVBackend = "PET_RDZV_BACKEND"
// EnvRDZVID is the environment variable name for the rdzv id.
EnvRDZVID = "PET_RDZV_ID"
// ENVRDZVConf is the environment variable name for the rdzv conf.
EnvRDZVConf = "PET_RDZV_CONF"
// EnvRDZVEndpoint is the environment variable name for the rdzv endpoint.
EnvRDZVEndpoint = "PET_RDZV_ENDPOINT"
// EnvRDZVStandalone is the environment variable name for the standalone mode.
EnvStandalone = "PET_STANDALONE"
// User-code launch related arguments.
// EnvMaxRestarts is the environment variable name for the maximum number of worker group restarts before failing.
EnvMaxRestarts = "PET_MAX_RESTARTS"
// EnvMonitorInterval is the environment variable name for the interval, in seconds, to monitor the state of workers.
EnvMonitorInterval = "PET_MONITOR_INTERVAL"
// EnvStartMethod is the environment variable name for the multiprocessing start method to use when creating workers, which could be fork, spawn and forkserver.
EnvStartMethod = "PET_START_METHOD"
// Worker/node size related arguments.
// EnvNProcPerNode is the environment variable name for the number of processes per node.
EnvNProcPerNode = "PET_N_PROC_PER_NODE"
These environment variables are related to elastic training.
Ref https://github.com/pytorch/pytorch/issues/67616
Still cannot tolerate agent failures with PyTorch 1.10. I will dive deep into it.
https://github.com/pytorch/pytorch/issues/67742
There is an upstream bug in PyTorch. It does not support SIGTERM when scaling down the workers. After the patch, the MVP works well. I will refine the proposal these days.
I refined the proposal, PTAL
There is a demo https://asciinema.org/a/446932 to help us understand the design.
The PoC of the operator is here https://github.com/kubeflow/training-operator/pull/1453/commits/cdfd1be1b6a0d04cbe578195da345fe433dcdd88
/cc @qiankunli
@gaocegege: GitHub didn't allow me to request PR reviews from the following users: qiankunli.
Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.
In response to this:
/cc @qiankunli
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
I updated the details about HPA. BTW, I think we should upgrade the PyTorchJob APIVersion to v2beta1 since it is a breaking change.
What is breaking about it? If users don't set elastic configurations, wouldn't it work as before?
I double-checked the CRD definition. We can keep compatibility with v1. common.JobStatus is changed but it does not block v1 if we upgrade the kubeflow/common version. Thus we can do it in v1.
type ReplicaStatus struct {
+ // LabelSelector is the selector for the replica.
+ LabelSelector *metav1.LabelSelector `json:"labelSelector,omitempty"`
// The number of actively running pods.
Active int32 `json:"active,omitempty"`
// The number of pods which reached phase Succeeded.
Succeeded int32 `json:"succeeded,omitempty"`
// The number of pods which reached phase Failed.
Failed int32 `json:"failed,omitempty"`
}
Right, API changes are breaking only if you remove or rename a field. I think the proposed changes look backwards compatible.
/cc @kubeflow/wg-training-leads
Could you please have another look?
https://github.com/kubeflow/training-operator/pull/1453 The PR is ready, and the coverall coverage increased (+7.1%) to 15.252%, PyTorch related test coverage is increased from 0% to 80%
The proposal overall looks good to me. A thing I am not clear is the metrics HPA part. I feel it's better to decouple with the job controller but I understand user may need some simple solution for easy onboarding. A global optimizer is something I am looking for. We can discuss it later. Looks like the PR is merged, let's merge this one as well.
/lgtm
@kubeflow/wg-training-leads could we get an "/approved" for this?
/approve
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: gaocegege, terrytangyuan
To complete the pull request process, please assign james-jwu after the PR has been reviewed.
You can assign the PR to them by writing /assign @james-jwu
in a comment when ready.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
/assign @james-jwu
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/assign @theadactyl
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.