training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

When will large model frameworks be supported. deepspeed for example

Open PeterChg opened this issue 2 years ago • 7 comments

PeterChg avatar Apr 21 '23 03:04 PeterChg

Can you add more info and update description? We love to add support for frameworks like Deepspeed and LLM examples. EBay are your thoughts?

johnugeorge avatar Apr 21 '23 04:04 johnugeorge

Can you add more info and update description? We love to add support for frameworks like Deepspeed and LLM examples. EBay are your thoughts?

With the open source of deepspeed, More and more companies use deepspeed to train LLM。but deepspeed framework has some differences with pytorch.

PeterChg avatar Apr 23 '23 02:04 PeterChg

@PeterChg You might be interested in this: https://github.com/kubeflow/mpi-operator/pull/549.

tenzen-y avatar Apr 24 '23 06:04 tenzen-y

Deepspeed supports various parallel launchers, such as pdsh (default, machines accessible via passwordless SSH), OpenMPI, slurm, and so on.

The mpi-operator in the training operator is executed through kubectl exec, and it is uncertain whether Deepspeed can support it. Currently, using mpi v2 (via passwordless SSH) would be more appropriate.

Syulin7 avatar Apr 24 '23 08:04 Syulin7

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 23 '23 20:08 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

github-actions[bot] avatar Sep 12 '23 20:09 github-actions[bot]

/lifecycle frozen

johnugeorge avatar Oct 08 '23 18:10 johnugeorge