volcano icon indicating copy to clipboard operation
volcano copied to clipboard

Is there any plan to support deepspeed job plugin for distributed training?

Open rockburning opened this issue 1 year ago • 4 comments

Deepspeed is now very popular in distributing training for ai scenario.Hope can support it to enhance volcano‘s ability. thanks

rockburning avatar Apr 25 '24 08:04 rockburning

Can you describe the specific scene?

hwdef avatar Apr 25 '24 09:04 hwdef

Can you describe the specific scene? training on multi node using deepspeed. in this case, need meet 2 condition. 1. need to ssh without password beteween pods (may can use ssh plugin); 2. need to know specifil hostfile. as it should specify the hostfile --hostfile. (may use svc to generate headless svc) .so the question is i need to kown the woker pod's name and genarate hostfile and mount it to pod. I want to use deepspeed framework to train my pytorch job using deepspeed to accelerate my training;
but it seems volcano don't support using deepspeed directly; as deepspeed framework need to specify the hostfile between diffrent job.so is there any solutions that can use mpi directy without support plugin. you can refer: https://www.deepspeed.ai/getting-started/ on chapter: DeepSpeed Resource Configuration (multi-node)

rockburning avatar Apr 25 '24 09:04 rockburning

volcano has plugins to meet your scenario

https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_env_plugin.md

https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_ssh_plugin.md

https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_svc_plugin.md

hwdef avatar Apr 25 '24 14:04 hwdef

volcano has plugins to meet your scenario

https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_env_plugin.md

https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_ssh_plugin.md

https://github.com/volcano-sh/volcano/blob/master/docs/user-guide/how_to_use_svc_plugin.md

OK 3q, I will try it to see if it can work

rockburning avatar Apr 26 '24 04:04 rockburning

@rockburning

May I ask if your attempt was successful?

GitEasonXu avatar May 07 '24 09:05 GitEasonXu

@rockburning

May I ask if your attempt was successful? yes just use svc plugin;and utilize the headless svc dns record;

rockburning avatar May 08 '24 06:05 rockburning

@rockburning May I ask if your attempt was successful? yes just use svc plugin;and utilize the headless svc dns record;

slot_value="${1:-8}"

this is the sample shell code to get all the host content="" for file in /etc/volcano/*.host; do file_content=$(cat "$file" | tr '\n' ' ') content="$content$file_content slot=$slot_value\n" done

echo -e "${content% }" > /etc/deepspeed-hostfile

rockburning avatar May 08 '24 06:05 rockburning