ChatGLM-6B [BUG/Help] ds_train_finetune.sh 多卡训练需要多少资源才行？

Is there an existing issue for this?

[X] I have searched the existing issues

Current Behavior

ds_train_finetune.sh 多卡训练时，我用到24G 3090 四张显卡还是torch.cuda.OutOfMemoryError: CUDA out of memory.

感觉不应该啊，96G的显存，用模型并行的方式还是不够用。所以 finetune大概需要多少资源才能训练？？
更改max_source_length 或者batch_size 也没有任何变换，是不是不应该啊

下面是参数设置。

CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port $MASTER_PORT main.py
--deepspeed deepspeed.json
--do_train
--train_file ./train.json
--test_file ./dev.json
--prompt_column text
--response_column answer
--overwrite_cache
--model_name_or_path ./chatglm-6b
--output_dir ./output_finetune/mining-chatglm-6b-ft-$LR
--overwrite_output_dir
--max_source_length 512
--max_target_length 100
--per_device_train_batch_size 4
--per_device_eval_batch_size 2
--gradient_accumulation_steps 4
--predict_with_generate
--max_steps 20000
--logging_steps 100
--save_steps 1000
--learning_rate $LR
--fp16

Expected Behavior

No response

Steps To Reproduce

sh ds_train_finetune.sh

Environment

- OS: CentOS
- Python:python3.9
- Transformers:4.28.0.dev0
- PyTorch:2.0.0
- CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :

Anything else?

No response

Apr 12 '23 11:04 RileyShe

+1 同样遇到此问题。

Apr 12 '23 13:04 Evilran

wsl2?

Apr 12 '23 13:04 Cherrysaber

我成功了堆了8块显卡

Apr 13 '23 00:04 superbigsea

我成功了堆了8块显卡

没这么多显卡。。

Apr 13 '23 01:04 RileyShe

显卡不够用，想使用 ZeRO Stage 3 的方式。改用了官网配置后报错 File "/root/anaconda3/lib/python3.9/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1128, in _partition_param param.ds_tensor.copy_(src_tensor) NotImplementedError: Cannot copy out of meta tensor; no data!

@duzx16 请问能使用 ZeRO Stage 3 的方式吗？

Apr 13 '23 02:04 RileyShe

四卡3090失败

Apr 13 '23 03:04 natureLanguageQing

我成功了堆了8块显卡

一共需要多少显存呀有用nvidia-smi看看吗==

Apr 13 '23 05:04 OceannTwT

你们就没发现deepspeed 启动的时候，显存是gpu数量的倍数吗？无法模型并行，所有卡都加载了同样的显存，至少推理时候测试是这样

Apr 13 '23 05:04 kevinuserdd

我成功了堆了8块显卡

一共需要多少显存呀有用nvidia-smi看看吗==

8*40g显存

Apr 13 '23 05:04 superbigsea

我成功了堆了8块显卡

一共需要多少显存呀有用nvidia-smi看看吗==

8*40g显存

用v100 8 *32还是爆显存了 A100实在是买不起啊

Apr 13 '23 06:04 OceannTwT

你们就没发现deepspeed 启动的时候，显存是gpu数量的倍数吗？无法模型并行，所有卡都加载了同样的显存，至少推理时候测试是这样

这个我猜测是 ZeRO Stage 2方式的原因，模型参数没有并行。但是我尝试用ZeRO Stage 3 时会报错。

Apr 13 '23 07:04 RileyShe

请教下deepspeed 训练完了之后怎么推理？

Apr 13 '23 07:04 superbigsea

7张A100 爆了

Apr 14 '23 02:04 LLLLLLoki

@duzx16 参考 #530 通过zero3 offload微调还是不行，求大神指导下。

Apr 14 '23 04:04 Evilran

你们就没发现deepspeed 启动的时候，显存是gpu数量的倍数吗？无法模型并行，所有卡都加载了同样的显存，至少推理时候测试是这样

这个我猜测是 ZeRO Stage 2方式的原因，模型参数没有并行。但是我尝试用ZeRO Stage 3 时会报错。

你是怎么切换到zero stage3的？ ds_config.json文件里面改的吗? 我觉得不是这个原因，应该就是模型本身不支持层并行。你可以看下deepspeed的源码？官方文档写了这句话“DeepSpeed provides a seamless inference mode for compatible transformer based models trained using DeepSpeed, Megatron, and HuggingFace, meaning that we don’t require any change on the modeling side such as exporting the model or creating a different checkpoint from your trained checkpoints. To run inference on multi-GPU for compatible models, provide the model parallelism degree and the checkpoint information or the model which is already loaded from a checkpoint, and DeepSpeed will do the rest. It will automatically partition the model as necessary, inject compatible high performance kernels into your model and manage the inter-gpu communication. For list of compatible models please see here.“ 我目前测下来只有bloom这个模型是可以模型并行，感觉无法解决了。你有什么想法吗

Apr 14 '23 07:04 kevinuserdd

你们就没发现deepspeed 启动的时候，显存是gpu数量的倍数吗？无法模型并行，所有卡都加载了同样的显存，至少推理时候测试是这样

这个我猜测是 ZeRO Stage 2方式的原因，模型参数没有并行。但是我尝试用ZeRO Stage 3 时会报错。

你是怎么切换到zero stage3的？ ds_config.json文件里面改的吗? 我觉得不是这个原因，应该就是模型本身不支持层并行。你可以看下deepspeed的源码？官方文档写了这句话“DeepSpeed provides a seamless inference mode for compatible transformer based models trained using DeepSpeed, Megatron, and HuggingFace, meaning that we don’t require any change on the modeling side such as exporting the model or creating a different checkpoint from your trained checkpoints. To run inference on multi-GPU for compatible models, provide the model parallelism degree and the checkpoint information or the model which is already loaded from a checkpoint, and DeepSpeed will do the rest. It will automatically partition the model as necessary, inject compatible high performance kernels into your model and manage the inter-gpu communication. For list of compatible models please see here.“ 我目前测下来只有bloom这个模型是可以模型并行，感觉无法解决了。你有什么想法吗

大佬bloomz-mt-7b1跑起来了？什么配置呀，求教

Apr 14 '23 07:04 Fcc-Roy

@duzx16 参考 #530 通过zero3 offload微调还是不行，求大神指导下。

之前不能是bug，大神已经修复了，用最新的代码，然后载入模型的部分加上empty_init=False，使用zero3的模式，4张3090能跑起来，我试了下，单卡显存占用16G，cpu offload大概100G

Apr 14 '23 07:04 Fcc-Roy

@duzx16 参考 #530 通过zero3 offload微调还是不行，求大神指导下。

之前不能是bug，大神已经修复了，用最新的代码，然后载入模型的部分加上empty_init=False，使用zero3的模式，4张3090能跑起来，我试了下，单卡显存占用16G，cpu offload大概100G

怎么弄？你把核心代码贴上来，而且我说的问题和你说的不是一个意思吧。我说的意思是用deepspeed 推理chatglm的时候，会出现模型无法并行，1张卡显存15g，2张卡30g，3张卡45g。 deepspeed.init_inference()推理阶段啊。。。。。。。。。你解释的我不太懂，最好贴个代码看看，推理一般就10行左右代码

Apr 14 '23 09:04 kevinuserdd

@duzx16 参考 #530 通过zero3 offload微调还是不行，求大神指导下。

之前不能是bug，大神已经修复了，用最新的代码，然后载入模型的部分加上empty_init=False，使用zero3的模式，4张3090能跑起来，我试了下，单卡显存占用16G，cpu offload大概100G

更新了最新模型和最新的代码（master）
加载模型时加上了相应的参数（model = AutoModel.from_pretrained(model_args.model_name_or_path, config=config, trust_remote_code=True, empty_init=False)）

Zero3方式还是报错。

@Fcc-Roy 可以看下你的deepspeed.json的配置吗

Apr 14 '23 10:04 RileyShe

@duzx16 参考 #530 通过zero3 offload微调还是不行，求大神指导下。

之前不能是bug，大神已经修复了，用最新的代码，然后载入模型的部分加上empty_init=False，使用zero3的模式，4张3090能跑起来，我试了下，单卡显存占用16G，cpu offload大概100G

怎么弄？你把核心代码贴上来，而且我说的问题和你说的不是一个意思吧。我说的意思是用deepspeed 推理chatglm的时候，会出现模型无法并行，1张卡显存15g，2张卡30g，3张卡45g。 deepspeed.init_inference()推理阶段啊。。。。。。。。。你解释的我不太懂，最好贴个代码看看，推理一般就10行左右代码

我就是说的训练阶段呀。预测还好吧，我就用model = AutoModel.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.float16).half().cuda()起的，但是我是用单卡做推理的，14G显存占用，多卡没试，一般用device_map="auto"就可以了。deepspeed的本来也不算是模型并行，主要也是用于训练的。

Apr 14 '23 10:04 Fcc-Roy

@duzx16 参考 #530 通过zero3 offload微调还是不行，求大神指导下。

之前不能是bug，大神已经修复了，用最新的代码，然后载入模型的部分加上empty_init=False，使用zero3的模式，4张3090能跑起来，我试了下，单卡显存占用16G，cpu offload大概100G

更新了最新模型和最新的代码（master）

加载模型时加上了相应的参数（model = AutoModel.from_pretrained(model_args.model_name_or_path, config=config, trust_remote_code=True, empty_init=False)）

Zero3方式还是报错。

@Fcc-Roy 可以看下你的deepspeed.json的配置吗

@RileyShe 我参考的这个https://github.com/OptimalScale/LMFlow/blob/main/configs/ds_config_zero3.json 基本就是拷贝了整个zero_optimization和optimizer

Apr 14 '23 10:04 Fcc-Roy

@duzx16 参考 #530 通过zero3 offload微调还是不行，求大神指导下。

之前不能是bug，大神已经修复了，用最新的代码，然后载入模型的部分加上empty_init=False，使用zero3的模式，4张3090能跑起来，我试了下，单卡显存占用16G，cpu offload大概100G

更新了最新模型和最新的代码（master）

加载模型时加上了相应的参数（model = AutoModel.from_pretrained(model_args.model_name_or_path, config=config, trust_remote_code=True, empty_init=False)）

Zero3方式还是报错。

@Fcc-Roy 可以看下你的deepspeed.json的配置吗

+1，我也都试了，还是报错。我是8卡A100,40G的显存

Apr 15 '23 19:04 danyang-rainbow

@duzx16 参考 #530 通过zero3 offload微调还是不行，求大神指导下。

之前不能是bug，大神已经修复了，用最新的代码，然后载入模型的部分加上empty_init=False，使用zero3的模式，4张3090能跑起来，我试了下，单卡显存占用16G，cpu offload大概100G

怎么弄？你把核心代码贴上来，而且我说的问题和你说的不是一个意思吧。我说的意思是用deepspeed 推理chatglm的时候，会出现模型无法并行，1张卡显存15g，2张卡30g，3张卡45g。 deepspeed.init_inference()推理阶段啊。。。。。。。。。你解释的我不太懂，最好贴个代码看看，推理一般就10行左右代码

我就是说的训练阶段呀。预测还好吧，我就用model = AutoModel.from_pretrained(model_path, trust_remote_code=True, torch_dtype=torch.float16).half().cuda()起的，但是我是用单卡做推理的，14G显存占用，多卡没试，一般用device_map="auto"就可以了。deepspeed的本来也不算是模型并行，主要也是用于训练的。

不是啊，这是个bug啊。。。单机推理我知道14g足够；但是chatglm这个模型，你用多卡推理的时候，显存会出现多倍的情况，而bloom就不会，多卡的时候显存会分摊开。

Apr 17 '23 02:04 kevinuserdd

运行ds_train_finetune.sh，两个3090也报显存不足，请问大佬们，有解决的吗

Apr 20 '23 03:04 younger-diao

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 11.50 GiB (GPU 0; 79.35 GiB total capacity; 34.50 GiB already allocated; 6.89 GiB free; 34.51 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Apr 21 '23 14:04 peterzhang2029

@duzx16 参考 #530 通过zero3 offload微调还是不行，求大神指导下。

之前不能是bug，大神已经修复了，用最新的代码，然后载入模型的部分加上empty_init=False，使用zero3的模式，4张3090能跑起来，我试了下，单卡显存占用16G，cpu offload大概100G

更新了最新模型和最新的代码（master）

加载模型时加上了相应的参数（model = AutoModel.from_pretrained(model_args.model_name_or_path, config=config, trust_remote_code=True, empty_init=False)）

Zero3方式还是报错。 @Fcc-Roy 可以看下你的deepspeed.json的配置吗

@RileyShe 我参考的这个https://github.com/OptimalScale/LMFlow/blob/main/configs/ds_config_zero3.json 基本就是拷贝了整个zero_optimization和optimizer

大佬请问配置的deepspeed和transformer版本是多少呢？

Apr 23 '23 06:04 yang1997yi

这边用单个 3090 zero2 off cpu 能跑起来（8 卡可以训练快点，不能提高模型上线，具体原因未知），脚本参考

LR=1e-4
  
MASTER_PORT=$(shuf -n 1 -i 10000-65535)

deepspeed --num_gpus=1 --master_port $MASTER_PORT main.py \
    --deepspeed zero2_off_cpu.json \
    --do_train \
    --train_file AdvertiseGen/train.json \
    --test_file AdvertiseGen/dev.json \
    --prompt_column content \
    --response_column summary \
    --overwrite_cache \
    --model_name_or_path /xdl/public/6515/glm6b/chatglm-6b \
    --output_dir ./output/adgen-chatglm-6b-ft-$LR \
    --overwrite_output_dir \
    --max_source_length 256 \
    --max_target_length 256 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --predict_with_generate \
    --max_steps 1000 \
    --logging_steps 10 \
    --save_steps 250 \
    --learning_rate $LR \
    --fp16

deepspeed 配置参考

{
  "fp16": {
      "enabled": "auto",
      "loss_scale": 0,
      "loss_scale_window": 1000,
      "initial_scale_power": 16,
      "hysteresis": 2,
      "min_loss_scale": 1
  },

  "optimizer": {
      "type": "AdamW",
      "params": {
          "lr": "auto",
          "betas": "auto",
          "eps": "auto",
          "weight_decay": "auto"
      }
  },

  "scheduler": {
      "type": "WarmupLR",
      "params": {
          "warmup_min_lr": "auto",
          "warmup_max_lr": "auto",
          "warmup_num_steps": "auto"
      }
  },

  "zero_optimization": {
      "stage": 2,
      "offload_optimizer": {
          "device": "cpu",
          "pin_memory": true
      },
      "allgather_partitions": true,
      "allgather_bucket_size": 2e8,
      "overlap_comm": true,
      "reduce_scatter": true,
      "reduce_bucket_size": 2e8,
      "contiguous_gradients": true
  },

  "csv_monitor" : {
    "enabled": true,
    "job_name" : "stage2_test"
  },

  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 100,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Apr 25 '23 03:04 luohuan02

显卡爆掉有解决吗

Apr 27 '23 06:04 Kino521

OutOfMemoryError: CUDA out of memory. Tried to allocate 22.99 GiB (GPU 0; 39.56 GiB total capacity; 22.99 GiB already allocated; 15.27 GiB free; 23.00 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Apr 27 '23 06:04 Kino521

设置了import os os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:12560"。没有效果

Apr 27 '23 06:04 Kino521

ChatGLM-6B ChatGLM-6B copied to clipboard

[BUG/Help] ds_train_finetune.sh 多卡训练 需要多少资源才行？

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Environment

Anything else?

ChatGLM-6B
ChatGLM-6B copied to clipboard

[BUG/Help] ds_train_finetune.sh 多卡训练需要多少资源才行？