CoLLiE
CoLLiE copied to clipboard
Collaborative Training of Large Language Models in an Efficient Way
如果端口冲突,则寻找一个未被占用的端口并修改 os.environ["MASTER_PORT"]
配置: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:29402 --nnodes=1 --nproc_per_node=8 train.py config.tp_size = 1 config.dp_size = 1 # 8 无所谓 config.pp_size = 1 config.train_epochs = 1 config.eval_per_n_steps = 0 config.eval_per_n_epochs = 1 config.train_micro_batch_size...
I am using zero 2 in training. The process gets stuck when initializing optimizer states. I'm able to run it using tp. Here is the package info: ``` python==3.10.13 deepspeed==0.12.6...
参考 [trl](https://github.com/huggingface/trl) 和 [DeepSpeed Chat](https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-chat/chinese/README.md),希望 collie 支持 RLHF 三阶段的训练流程。