CoLLiE icon indicating copy to clipboard operation
CoLLiE copied to clipboard

Collaborative Training of Large Language Models in an Efficient Way

Results 26 CoLLiE issues
Sort by recently updated
recently updated
newest added

如果端口冲突,则寻找一个未被占用的端口并修改 os.environ["MASTER_PORT"]

配置: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:29402 --nnodes=1 --nproc_per_node=8 train.py config.tp_size = 1 config.dp_size = 1 # 8 无所谓 config.pp_size = 1 config.train_epochs = 1 config.eval_per_n_steps = 0 config.eval_per_n_epochs = 1 config.train_micro_batch_size...

例如从头训练一个1B的llama2架构的模型.

help wanted

I am using zero 2 in training. The process gets stuck when initializing optimizer states. I'm able to run it using tp. Here is the package info: ``` python==3.10.13 deepspeed==0.12.6...

参考 [trl](https://github.com/huggingface/trl) 和 [DeepSpeed Chat](https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-chat/chinese/README.md),希望 collie 支持 RLHF 三阶段的训练流程。