MedicalGPT
MedicalGPT copied to clipboard

Published 20 hours ago •

Reame
Issues

单机多卡预训练ChatGLM报错：

Open zzzhaoguziji opened this issue 1 year ago • 3 comments

Describe the Question

Please provide a clear and concise description of what the question is. 单卡训练可以，单机多卡不形训练命令为： CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes 1 --nproc_per_node 1 pretraining.py
--model_type chatglm
--model_name_or_path ../chatglm
--train_file_dir ../data/pretrain
--validation_file_dir ../data/pretrain
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--do_train
--do_eval
--use_peft True
--seed 42
--fp16
--num_train_epochs 0.5
--learning_rate 2e-4
--warmup_ratio 0.05
--weight_decay 0.01
--logging_strategy steps
--logging_steps 10
--eval_steps 50
--evaluation_strategy steps
--save_steps 500
--save_strategy steps
--save_total_limit 3
--gradient_accumulation_steps 1
--preprocessing_num_workers 1
--block_size 1024
--output_dir outputs-pt-v1
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--target_modules all
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.05
--torch_dtype float16
--device_map auto
--report_to tensorboard
--ddp_find_unused_parameters False
--gradient_checkpointing True
--deepspeed deepspeed_config.json

Describe your attempts

[ ] I walked through the tutorials
[ ] I checked the documentation
[ ] I checked to make sure that this is not a duplicate question

Jun 09 '23 02:06 zzzhaoguziji

参数设置需要为：CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes 1 --nproc_per_node 2 pretraining.py torchrun模式下，是每张卡加载全部模型参数，数据并行训练，如果显存不足，可以开启cpu_offload

Jun 12 '23 12:06 shibing624

还可以这样：CUDA_VISIBLE_DEVICES=0,1 python pretraining.py 使用device_map="auto"可以自动分配多个卡加载模型。

Jun 13 '23 04:06 shibing624

谢谢大佬，我再试试

Jun 14 '23 09:06 zzzhaoguziji

还可以这样：CUDA_VISIBLE_DEVICES=0,1 python pretraining.py 使用device_map="auto"可以自动分配多个卡加载模型。

针对glm6b2我试过了，还是不行。报同样的错误。但是glm6b就能用。

Jun 29 '23 11:06 boxter007

glm和glm2模型参数是不一样的，转换时要修改，我现在也在入手这个，有结果后再来评论，插个眼

Jul 05 '23 06:07 Alfer-Feng

参数设置需要为：CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes 1 --nproc_per_node 2 pretraining.py torchrun模式下，是每张卡加载全部模型参数，数据并行训练，如果显存不足，可以开启cpu_offload

请问下 cpu_offload 怎么开启？

Jul 21 '23 16:07 archerbj

请问最后解决了吗？怎么解决的可以分享一下吗 @zzzhaoguziji @boxter007

Jul 27 '23 11:07 chloefresh

glm和glm2模型参数是不一样的，转换时要修改，我现在也在入手这个，有结果后再来评论，插个眼

有结果了吗？

Jul 27 '23 11:07 chloefresh