LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

可以提供一个可以参考的的accelerate config_file么...accelerate一直启动不起来

Open Chenzongchao opened this issue 1 year ago • 4 comments

command_file: null commands: null compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} gpu_ids: null machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main megatron_lm_config: {} mixed_precision: fp16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true tpu_name: null tpu_zone: null use_cpu: false

src/train_sft.py
--model_name_or_path /models/Ziya-LLaMA-13B-Pretrain-v1/
--do_train
--dataset alpaca_gpt4_zh
--finetuning_type lora
--output_dir sft_save_model_checkpoint_V2
--overwrite_cache
--per_device_train_batch_size 2
--gradient_accumulation_steps 16
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 1000
--learning_rate 5e-5
--num_train_epochs 1.0
--resume_lora_training False
--plot_loss
--max_source_length 1200
--max_target_length 768
--fp16

Chenzongchao avatar Jun 05 '23 09:06 Chenzongchao

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 你的GPU数量
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

hiyouga avatar Jun 05 '23 09:06 hiyouga

是有哪些细节我没注意到么?按这个config也启动不起来,一直报: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 20863) of binary 这是什么原因? 启动命令不就是accelerate launch src/train_XX.py +应该加的参数 ?

hepj987 avatar Jun 09 '23 12:06 hepj987

@hepj987 使用 accelerate test 测试一下。

hiyouga avatar Jun 16 '23 23:06 hiyouga

是有哪些细节我没注意到么?按这个config也启动不起来,一直报: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 20863) of binary 这是什么原因? 启动命令不就是accelerate launch src/train_XX.py +应该加的参数 ?

解决了嚒

gebilaoman avatar Jun 19 '23 09:06 gebilaoman