training-operator When will large model frameworks be supported. deepspeed for example

When will large model frameworks be supported. deepspeed for example

Open PeterChg opened this issue 2 years ago • 7 comments

Apr 21 '23 03:04 PeterChg

Can you add more info and update description? We love to add support for frameworks like Deepspeed and LLM examples. EBay are your thoughts?

Apr 21 '23 04:04 johnugeorge

Can you add more info and update description? We love to add support for frameworks like Deepspeed and LLM examples. EBay are your thoughts?

With the open source of deepspeed, More and more companies use deepspeed to train LLM。but deepspeed framework has some differences with pytorch.

Apr 23 '23 02:04 PeterChg

@PeterChg You might be interested in this: https://github.com/kubeflow/mpi-operator/pull/549.

Apr 24 '23 06:04 tenzen-y

Deepspeed supports various parallel launchers, such as pdsh (default, machines accessible via passwordless SSH), OpenMPI, slurm, and so on.

The mpi-operator in the training operator is executed through kubectl exec, and it is uncertain whether Deepspeed can support it. Currently, using mpi v2 (via passwordless SSH) would be more appropriate.

Apr 24 '23 08:04 Syulin7

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Aug 23 '23 20:08 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

Sep 12 '23 20:09 github-actions[bot]

/lifecycle frozen

Oct 08 '23 18:10 johnugeorge

training-operator training-operator copied to clipboard

When will large model frameworks be supported. deepspeed for example

training-operator
training-operator copied to clipboard