Is there any plan to support deepspeed job plugin for distributed training?
Deepspeed is now very popular in distributing training for ai scenario.Hope can support it to enhance volcano‘s ability. thanks
Can you describe the specific scene?
Can you describe the specific scene? training on multi node using deepspeed. in this case, need meet 2 condition. 1. need to ssh without password beteween pods (may can use ssh plugin); 2. need to know specifil hostfile. as it should specify the hostfile --hostfile. (may use svc to generate headless svc) .so the question is i need to kown the woker pod's name and genarate hostfile and mount it to pod. I want to use deepspeed framework to train my pytorch job using deepspeed to accelerate my training;
but it seems volcano don't support using deepspeed directly; as deepspeed framework need to specify the hostfile between diffrent job.so is there any solutions that can use mpi directy without support plugin. you can refer: https://www.deepspeed.ai/getting-started/ on chapter: DeepSpeed Resource Configuration (multi-node)
volcano has plugins to meet your scenario
https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_env_plugin.md
https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_ssh_plugin.md
https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_svc_plugin.md
volcano has plugins to meet your scenario
https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_env_plugin.md
https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_ssh_plugin.md
https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_svc_plugin.md
OK 3q, I will try it to see if it can work
@rockburning
May I ask if your attempt was successful?
@rockburning
May I ask if your attempt was successful? yes just use svc plugin;and utilize the headless svc dns record;
@rockburning May I ask if your attempt was successful? yes just use svc plugin;and utilize the headless svc dns record;
slot_value="${1:-8}"
this is the sample shell code to get all the host content="" for file in /etc/volcano/*.host; do file_content=$(cat "$file" | tr '\n' ' ') content="$content$file_content slot=$slot_value\n" done
echo -e "${content% }" > /etc/deepspeed-hostfile