GLM icon indicating copy to clipboard operation
GLM copied to clipboard

配置问题

Open zfstr opened this issue 3 years ago • 3 comments

what do these parameters mean???? image

zfstr avatar Oct 02 '22 02:10 zfstr

These are arguments of the DeepSpeed launcher. NUM_WORKERS is used to set --num_nodes, which means the number of servers used for pretraining NUM_GPUS_PER_WORKER is used to set --num_gpus, which means means the number of GPUs on each server HOST_FILE_PATH is the path to an OpenMPI-style hostfile. You can find more details from https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node MP_SIZE is the model parallel size

duzx16 avatar Oct 02 '22 09:10 duzx16

Thank you for your answer, I want to do model parallel experiment, but my checkpoint only has one model rank_00.pt, using MP_SIZE=4, it will show that the files rank_01.pt,rank_02.pt,rank_03.pt can not be found. This problem has been bothering me for a long time. If you know, I hope you can help me.

zfstr avatar Oct 03 '22 06:10 zfstr

You need to divide the downloaded checkpoint with change_mp.py, following the instruction in https://github.com/THUDM/GLM#model-parallelism

duzx16 avatar Oct 03 '22 16:10 duzx16