LLaMA-Factory
LLaMA-Factory copied to clipboard
可以提供一个可以参考的的accelerate config_file么...accelerate一直启动不起来
command_file: null commands: null compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 1 gradient_clipping: 1.0 offload_optimizer_device: none offload_param_device: none zero3_init_flag: true zero3_save_16bit_model: true zero_stage: 3 distributed_type: DEEPSPEED downcast_bf16: 'no' dynamo_backend: 'NO' fsdp_config: {} gpu_ids: null machine_rank: 0 main_process_ip: null main_process_port: null main_training_function: main megatron_lm_config: {} mixed_precision: fp16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true tpu_name: null tpu_zone: null use_cpu: false
src/train_sft.py
--model_name_or_path /models/Ziya-LLaMA-13B-Pretrain-v1/
--do_train
--dataset alpaca_gpt4_zh
--finetuning_type lora
--output_dir sft_save_model_checkpoint_V2
--overwrite_cache
--per_device_train_batch_size 2
--gradient_accumulation_steps 16
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 1000
--learning_rate 5e-5
--num_train_epochs 1.0
--resume_lora_training False
--plot_loss
--max_source_length 1200
--max_target_length 768
--fp16
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 你的GPU数量
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
是有哪些细节我没注意到么?按这个config也启动不起来,一直报: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 20863) of binary 这是什么原因? 启动命令不就是accelerate launch src/train_XX.py +应该加的参数 ?
@hepj987 使用 accelerate test
测试一下。
是有哪些细节我没注意到么?按这个config也启动不起来,一直报: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 0 (pid: 20863) of binary 这是什么原因? 启动命令不就是accelerate launch src/train_XX.py +应该加的参数 ?
解决了嚒