CoLLiE issues

1

如果端口冲突，则寻找一个未被占用的端口并修改 os.environ["MASTER_PORT"]

gyt1145028706

fix(model): add load for safetensor

ti-mm

lomo训练65b llama实测 Lomo is incompatible with pipeline parallelism

1

配置： CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --rdzv_backend=c10d --rdzv_endpoint=localhost:29402 --nnodes=1 --nproc_per_node=8 train.py config.tp_size = 1 config.dp_size = 1 # 8 无所谓 config.pp_size = 1 config.train_epochs = 1 config.eval_per_n_steps = 0 config.eval_per_n_epochs = 1 config.train_micro_batch_size...

zlh1992

能否增加一个从头预训练的例子？

1

例如从头训练一个1B的llama2架构的模型.

liujuncn

help wanted

Zero 2 gets stuck when initializing optimizer states

I am using zero 2 in training. The process gets stuck when initializing optimizer states. I'm able to run it using tp. Here is the package info: ``` python==3.10.13 deepspeed==0.12.6...

tengxiaoliu

支持 RLHF

参考 [trl](https://github.com/huggingface/trl) 和 [DeepSpeed Chat](https://github.com/microsoft/DeepSpeed/blob/master/blogs/deepspeed-chat/chinese/README.md)，希望 collie 支持 RLHF 三阶段的训练流程。

KYLN24

CoLLiE
CoLLiE copied to clipboard

Metadata

fix(trainer): add save config and tokenizer

fix(dist_utils): fix port conflict in setup_distribution

Add Mistral tp&pp model

fix(model): add load for safetensor

Add Qwen2 tp&pp model

希望能支持 safetensors 格式的权重

lomo训练65b llama实测 Lomo is incompatible with pipeline parallelism

能否增加一个从头预训练的例子？

Zero 2 gets stuck when initializing optimizer states

支持 RLHF

← Metadata

Owner

Metadata

CoLLiE CoLLiE copied to clipboard

Metadata

← Metadata

Owner

Metadata

CoLLiE
CoLLiE copied to clipboard