配置问题
what do these parameters mean???? 
These are arguments of the DeepSpeed launcher.
NUM_WORKERS is used to set --num_nodes, which means the number of servers used for pretraining
NUM_GPUS_PER_WORKER is used to set --num_gpus, which means means the number of GPUs on each server
HOST_FILE_PATH is the path to an OpenMPI-style hostfile.
You can find more details from https://www.deepspeed.ai/getting-started/#resource-configuration-multi-node
MP_SIZE is the model parallel size
Thank you for your answer, I want to do model parallel experiment, but my checkpoint only has one model rank_00.pt, using MP_SIZE=4, it will show that the files rank_01.pt,rank_02.pt,rank_03.pt can not be found. This problem has been bothering me for a long time. If you know, I hope you can help me.
You need to divide the downloaded checkpoint with change_mp.py, following the instruction in https://github.com/THUDM/GLM#model-parallelism