MedicalGPT icon indicating copy to clipboard operation
MedicalGPT copied to clipboard

单机多卡预训练ChatGLM报错:

Open zzzhaoguziji opened this issue 1 year ago • 3 comments

Describe the Question

Please provide a clear and concise description of what the question is. 单卡训练可以,单机多卡不形 训练命令为: CUDA_VISIBLE_DEVICES=0,1 torchrun --nnodes 1 --nproc_per_node 1 pretraining.py
--model_type chatglm
--model_name_or_path ../chatglm
--train_file_dir ../data/pretrain
--validation_file_dir ../data/pretrain
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--do_train
--do_eval
--use_peft True
--seed 42
--fp16
--num_train_epochs 0.5
--learning_rate 2e-4
--warmup_ratio 0.05
--weight_decay 0.01
--logging_strategy steps
--logging_steps 10
--eval_steps 50
--evaluation_strategy steps
--save_steps 500
--save_strategy steps
--save_total_limit 3
--gradient_accumulation_steps 1
--preprocessing_num_workers 1
--block_size 1024
--output_dir outputs-pt-v1
--overwrite_output_dir
--ddp_timeout 30000
--logging_first_step True
--target_modules all
--lora_rank 8
--lora_alpha 16
--lora_dropout 0.05
--torch_dtype float16
--device_map auto
--report_to tensorboard
--ddp_find_unused_parameters False
--gradient_checkpointing True
--deepspeed deepspeed_config.json

Describe your attempts

  • [ ] I walked through the tutorials
  • [ ] I checked the documentation
  • [ ] I checked to make sure that this is not a duplicate question 微信截图_20230609104145

zzzhaoguziji avatar Jun 09 '23 02:06 zzzhaoguziji