community feat(pytorch): Add elastic proposal

feat(pytorch): Add elastic proposal

Open gaocegege opened this issue 2 years ago • 25 comments

Aug 16 '21 12:08 gaocegege

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gaocegege To complete the pull request process, please assign james-jwu after the PR has been reviewed. You can assign the PR to them by writing /assign @james-jwu in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Aug 16 '21 12:08 google-oss-robot

/cc @kubeflow/wg-training-leads @alculquicondor @zw0610

Aug 18 '21 07:08 gaocegege

Nice proposal. I will leave some comments tomorrow.

Aug 19 '21 06:08 Jeffwan

We may also have this problem: https://github.com/pytorch/pytorch/issues/65992

We should consider it in this proposal.

Oct 27 '21 02:10 gaocegege

�We may also have this problem: pytorch/pytorch#65992

We should consider it in this proposal.

Could we have a "environment variable list for PyTorch distributed training"? Then we can further exploit whether some fields could be merged and how to modify the replica spec.

Oct 27 '21 03:10 zw0610

Could we have a "environment variable list for PyTorch distributed training"? Then we can further exploit whether some fields could be merged and how to modify the replica spec.

Sure, of course. Working on it.

Oct 27 '21 03:10 gaocegege

	// Rendezvous related arguments

	// EnvRDZVBackend is the environment variable name for the rdzv backend.
	EnvRDZVBackend = "PET_RDZV_BACKEND"
	// EnvRDZVID is the environment variable name for the rdzv id.
	EnvRDZVID = "PET_RDZV_ID"
	// ENVRDZVConf is the environment variable name for the rdzv conf.
	EnvRDZVConf = "PET_RDZV_CONF"
	// EnvRDZVEndpoint is the environment variable name for the rdzv endpoint.
	EnvRDZVEndpoint = "PET_RDZV_ENDPOINT"
	// EnvRDZVStandalone is the environment variable name for the standalone mode.
	EnvStandalone = "PET_STANDALONE"

	// User-code launch related arguments.
	
	// EnvMaxRestarts is the environment variable name for the maximum number of worker group restarts before failing.
	EnvMaxRestarts = "PET_MAX_RESTARTS"
	// EnvMonitorInterval is the environment variable name for the interval, in seconds, to monitor the state of workers.
	EnvMonitorInterval = "PET_MONITOR_INTERVAL"
	// EnvStartMethod is the environment variable name for the multiprocessing start method to use when creating workers, which could be fork, spawn and forkserver.
	EnvStartMethod = "PET_START_METHOD"

	// Worker/node size related arguments.

	// EnvNProcPerNode is the environment variable name for the number of processes per node.
	EnvNProcPerNode = "PET_N_PROC_PER_NODE"

These environment variables are related to elastic training.

Oct 27 '21 06:10 gaocegege

Ref https://github.com/pytorch/pytorch/issues/67616

Still cannot tolerate agent failures with PyTorch 1.10. I will dive deep into it.

Nov 02 '21 03:11 gaocegege

https://github.com/pytorch/pytorch/issues/67742

There is an upstream bug in PyTorch. It does not support SIGTERM when scaling down the workers. After the patch, the MVP works well. I will refine the proposal these days.

Nov 03 '21 06:11 gaocegege

I refined the proposal, PTAL

There is a demo https://asciinema.org/a/446932 to help us understand the design.

The PoC of the operator is here https://github.com/kubeflow/training-operator/pull/1453/commits/cdfd1be1b6a0d04cbe578195da345fe433dcdd88

Nov 04 '21 08:11 gaocegege

/cc @qiankunli

Nov 04 '21 08:11 gaocegege

@gaocegege: GitHub didn't allow me to request PR reviews from the following users: qiankunli.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @qiankunli

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Nov 04 '21 08:11 google-oss-prow[bot]

I updated the details about HPA. BTW, I think we should upgrade the PyTorchJob APIVersion to v2beta1 since it is a breaking change.

Nov 10 '21 09:11 gaocegege

What is breaking about it? If users don't set elastic configurations, wouldn't it work as before?

Nov 10 '21 14:11 alculquicondor

I double-checked the CRD definition. We can keep compatibility with v1. common.JobStatus is changed but it does not block v1 if we upgrade the kubeflow/common version. Thus we can do it in v1.

type ReplicaStatus struct {
+	// LabelSelector is the selector for the replica.
+	LabelSelector *metav1.LabelSelector `json:"labelSelector,omitempty"`
	// The number of actively running pods.
	Active int32 `json:"active,omitempty"`
	// The number of pods which reached phase Succeeded.
	Succeeded int32 `json:"succeeded,omitempty"`
	// The number of pods which reached phase Failed.
	Failed int32 `json:"failed,omitempty"`
}

Nov 11 '21 02:11 gaocegege

Right, API changes are breaking only if you remove or rename a field. I think the proposed changes look backwards compatible.

Nov 11 '21 14:11 alculquicondor

/cc @kubeflow/wg-training-leads

Could you please have another look?

https://github.com/kubeflow/training-operator/pull/1453 The PR is ready, and the coverall coverage increased (+7.1%) to 15.252%, PyTorch related test coverage is increased from 0% to 80%

Nov 19 '21 03:11 gaocegege

The proposal overall looks good to me. A thing I am not clear is the metrics HPA part. I feel it's better to decouple with the job controller but I understand user may need some simple solution for easy onboarding. A global optimizer is something I am looking for. We can discuss it later. Looks like the PR is merged, let's merge this one as well.

/lgtm

Nov 29 '21 15:11 Jeffwan

@kubeflow/wg-training-leads could we get an "/approved" for this?

Dec 15 '21 19:12 theadactyl

/approve

Dec 15 '21 19:12 terrytangyuan

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: gaocegege, terrytangyuan To complete the pull request process, please assign james-jwu after the PR has been reviewed. You can assign the PR to them by writing /assign @james-jwu in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Dec 15 '21 19:12 google-oss-prow[bot]

/assign @james-jwu

Jan 26 '22 21:01 terrytangyuan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Apr 27 '22 20:04 stale[bot]

/assign @theadactyl

Apr 27 '22 22:04 terrytangyuan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sep 21 '22 05:09 stale[bot]

community community copied to clipboard

feat(pytorch): Add elastic proposal

community
community copied to clipboard