FastChat
FastChat copied to clipboard
How to use a single GPU for training?
Hi! I use a single gpu A100(40G).
export NCCL_IB_DISABLE=1;
export NCCL_P2P_DISABLE=1;
export NCCL_DEBUG=INFO;
export NCCL_SOCKET_IFNAME=en,eth,em,bond;
export CXX=g++;
deepspeed --num_gpus 1 --num_nodes 1 \
fastchat/train/train_mem.py \
--model_name_or_path ../hf-llama-7B \
--data_path ../merged.json \
--bf16 True \
--output_dir finetune_output \
--num_train_epochs 3 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 16 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 1200 \
--save_total_limit 10 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 512 \
--gradient_checkpointing True \
--lazy_preprocess True \
--deepspeed deepspeed.json
deepspeed.json:
{
"zero_optimization":{
"stage":3,
"offload_optimizer":{
"device":"cpu",
"pin_memory":true
},
"overlap_comm":true,
"contiguous_gradients":true
},
"optimizer":{
"type":"AdamW",
"params":{
"lr":"auto",
"betas":"auto",
"eps":"auto",
"weight_decay":"auto"
}
},
"train_micro_batch_size_per_gpu":"auto"
}
but I got an error code: -9.
Can you help me?
Hi, @merrymercy Can you help me,please?
when you got a error code with -9, it means the system oom ! you can see the logs at /var/log/kernel.log. so, you need upgrade le memory ! 系统内存溢出啦! 需要将机器的配置升级下