training-operator
training-operator copied to clipboard
When will large model frameworks be supported. deepspeed for example
Can you add more info and update description? We love to add support for frameworks like Deepspeed and LLM examples. EBay are your thoughts?
Can you add more info and update description? We love to add support for frameworks like Deepspeed and LLM examples. EBay are your thoughts?
With the open source of deepspeed, More and more companies use deepspeed to train LLM。but deepspeed framework has some differences with pytorch.
@PeterChg You might be interested in this: https://github.com/kubeflow/mpi-operator/pull/549.
Deepspeed supports various parallel launchers, such as pdsh (default, machines accessible via passwordless SSH), OpenMPI, slurm, and so on.
The mpi-operator in the training operator is executed through kubectl exec, and it is uncertain whether Deepspeed can support it. Currently, using mpi v2 (via passwordless SSH) would be more appropriate.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.
/lifecycle frozen