community icon indicating copy to clipboard operation
community copied to clipboard

feat(pytorch): Add elastic proposal

Open gaocegege opened this issue 2 years ago • 25 comments

Signed-off-by: cegao [email protected]

gaocegege avatar Aug 16 '21 12:08 gaocegege

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gaocegege To complete the pull request process, please assign james-jwu after the PR has been reviewed. You can assign the PR to them by writing /assign @james-jwu in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

google-oss-robot avatar Aug 16 '21 12:08 google-oss-robot

/cc @kubeflow/wg-training-leads @alculquicondor @zw0610

gaocegege avatar Aug 18 '21 07:08 gaocegege

Nice proposal. I will leave some comments tomorrow.

Jeffwan avatar Aug 19 '21 06:08 Jeffwan

We may also have this problem: https://github.com/pytorch/pytorch/issues/65992

We should consider it in this proposal.

gaocegege avatar Oct 27 '21 02:10 gaocegege

�We may also have this problem: pytorch/pytorch#65992

We should consider it in this proposal.

Could we have a "environment variable list for PyTorch distributed training"? Then we can further exploit whether some fields could be merged and how to modify the replica spec.

zw0610 avatar Oct 27 '21 03:10 zw0610

Could we have a "environment variable list for PyTorch distributed training"? Then we can further exploit whether some fields could be merged and how to modify the replica spec.

Sure, of course. Working on it.

gaocegege avatar Oct 27 '21 03:10 gaocegege

	// Rendezvous related arguments

	// EnvRDZVBackend is the environment variable name for the rdzv backend.
	EnvRDZVBackend = "PET_RDZV_BACKEND"
	// EnvRDZVID is the environment variable name for the rdzv id.
	EnvRDZVID = "PET_RDZV_ID"
	// ENVRDZVConf is the environment variable name for the rdzv conf.
	EnvRDZVConf = "PET_RDZV_CONF"
	// EnvRDZVEndpoint is the environment variable name for the rdzv endpoint.
	EnvRDZVEndpoint = "PET_RDZV_ENDPOINT"
	// EnvRDZVStandalone is the environment variable name for the standalone mode.
	EnvStandalone = "PET_STANDALONE"

	// User-code launch related arguments.
	
	// EnvMaxRestarts is the environment variable name for the maximum number of worker group restarts before failing.
	EnvMaxRestarts = "PET_MAX_RESTARTS"
	// EnvMonitorInterval is the environment variable name for the interval, in seconds, to monitor the state of workers.
	EnvMonitorInterval = "PET_MONITOR_INTERVAL"
	// EnvStartMethod is the environment variable name for the multiprocessing start method to use when creating workers, which could be fork, spawn and forkserver.
	EnvStartMethod = "PET_START_METHOD"

	// Worker/node size related arguments.

	// EnvNProcPerNode is the environment variable name for the number of processes per node.
	EnvNProcPerNode = "PET_N_PROC_PER_NODE"

These environment variables are related to elastic training.

gaocegege avatar Oct 27 '21 06:10 gaocegege

Ref https://github.com/pytorch/pytorch/issues/67616

Still cannot tolerate agent failures with PyTorch 1.10. I will dive deep into it.

gaocegege avatar Nov 02 '21 03:11 gaocegege

https://github.com/pytorch/pytorch/issues/67742

There is an upstream bug in PyTorch. It does not support SIGTERM when scaling down the workers. After the patch, the MVP works well. I will refine the proposal these days.

gaocegege avatar Nov 03 '21 06:11 gaocegege

I refined the proposal, PTAL

There is a demo https://asciinema.org/a/446932 to help us understand the design.

The PoC of the operator is here https://github.com/kubeflow/training-operator/pull/1453/commits/cdfd1be1b6a0d04cbe578195da345fe433dcdd88

gaocegege avatar Nov 04 '21 08:11 gaocegege

/cc @qiankunli

gaocegege avatar Nov 04 '21 08:11 gaocegege

@gaocegege: GitHub didn't allow me to request PR reviews from the following users: qiankunli.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @qiankunli

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Nov 04 '21 08:11 google-oss-prow[bot]

I updated the details about HPA. BTW, I think we should upgrade the PyTorchJob APIVersion to v2beta1 since it is a breaking change.

gaocegege avatar Nov 10 '21 09:11 gaocegege

What is breaking about it? If users don't set elastic configurations, wouldn't it work as before?

alculquicondor avatar Nov 10 '21 14:11 alculquicondor

I double-checked the CRD definition. We can keep compatibility with v1. common.JobStatus is changed but it does not block v1 if we upgrade the kubeflow/common version. Thus we can do it in v1.

type ReplicaStatus struct {
+	// LabelSelector is the selector for the replica.
+	LabelSelector *metav1.LabelSelector `json:"labelSelector,omitempty"`
	// The number of actively running pods.
	Active int32 `json:"active,omitempty"`
	// The number of pods which reached phase Succeeded.
	Succeeded int32 `json:"succeeded,omitempty"`
	// The number of pods which reached phase Failed.
	Failed int32 `json:"failed,omitempty"`
}

gaocegege avatar Nov 11 '21 02:11 gaocegege

Right, API changes are breaking only if you remove or rename a field. I think the proposed changes look backwards compatible.

alculquicondor avatar Nov 11 '21 14:11 alculquicondor

/cc @kubeflow/wg-training-leads

Could you please have another look?

https://github.com/kubeflow/training-operator/pull/1453 The PR is ready, and the coverall coverage increased (+7.1%) to 15.252%, PyTorch related test coverage is increased from 0% to 80%

gaocegege avatar Nov 19 '21 03:11 gaocegege

The proposal overall looks good to me. A thing I am not clear is the metrics HPA part. I feel it's better to decouple with the job controller but I understand user may need some simple solution for easy onboarding. A global optimizer is something I am looking for. We can discuss it later. Looks like the PR is merged, let's merge this one as well.

/lgtm

Jeffwan avatar Nov 29 '21 15:11 Jeffwan

@kubeflow/wg-training-leads could we get an "/approved" for this?

theadactyl avatar Dec 15 '21 19:12 theadactyl

/approve

terrytangyuan avatar Dec 15 '21 19:12 terrytangyuan

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gaocegege, terrytangyuan To complete the pull request process, please assign james-jwu after the PR has been reviewed. You can assign the PR to them by writing /assign @james-jwu in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

google-oss-prow[bot] avatar Dec 15 '21 19:12 google-oss-prow[bot]

/assign @james-jwu

terrytangyuan avatar Jan 26 '22 21:01 terrytangyuan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Apr 27 '22 20:04 stale[bot]

/assign @theadactyl

terrytangyuan avatar Apr 27 '22 22:04 terrytangyuan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 21 '22 05:09 stale[bot]