FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

How to use a single GPU for training?

Open aresa7796 opened this issue 2 years ago • 1 comments

Hi! I use a single gpu A100(40G).

export NCCL_IB_DISABLE=1;
export NCCL_P2P_DISABLE=1;
export NCCL_DEBUG=INFO;
export NCCL_SOCKET_IFNAME=en,eth,em,bond;
export CXX=g++;
deepspeed --num_gpus 1 --num_nodes 1 \
fastchat/train/train_mem.py \
    --model_name_or_path ../hf-llama-7B  \
    --data_path ../merged.json \
    --bf16 True \
    --output_dir finetune_output \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 16 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 512 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --deepspeed deepspeed.json

deepspeed.json:

{
    "zero_optimization":{
        "stage":3,
        "offload_optimizer":{
            "device":"cpu",
            "pin_memory":true
        },
        "overlap_comm":true,
        "contiguous_gradients":true
    },
    "optimizer":{
        "type":"AdamW",
        "params":{
            "lr":"auto",
            "betas":"auto",
            "eps":"auto",
            "weight_decay":"auto"
        }
    },
    "train_micro_batch_size_per_gpu":"auto"
}

but I got an error code: -9.

Can you help me?

aresa7796 avatar Jun 09 '23 06:06 aresa7796

Hi, @merrymercy Can you help me,please?

aresa7796 avatar Jun 09 '23 06:06 aresa7796

when you got a error code with -9, it means the system oom ! you can see the logs at /var/log/kernel.log. so, you need upgrade le memory ! 系统内存溢出啦! 需要将机器的配置升级下

MyQiongbao avatar Jun 17 '23 03:06 MyQiongbao