InternImage icon indicating copy to clipboard operation
InternImage copied to clipboard

AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size

Open xiangchen-zhao opened this issue 1 year ago • 6 comments

I am trying to run the scripts you provide in "Huggingface Accelerate Integration of Deepspeed" accelerate launch --config_file configs/accelerate/dist_8gpus_zero3_offload.yaml main_accelerate.py --cfg configs/internimage_t_1k_224.yaml --data-path /mnt/lustre/share/images --batch-size 128 --accumulation-steps 4 --output output_zero3_offload However, I got an AssertionError AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 4096 != 128 * 4 * 1

And I found this error is caused by an exception in deepspeed package (deepspeed/runtime/config.py Line691) image

I'm wondering if it's a version issue. Could you give the version of accelerate and deepspeed you used? Thanks!

my environment: CUDA: 11.3 torch: 1.11.0+cu113 python: 3.7.16 accelerator:0.18.0 deepspeed:0.9.0 Ubuntu: 18.04 8 * NVIDIA A10G NVIDIA-SMI 510.47.03

xiangchen-zhao avatar Apr 19 '23 00:04 xiangchen-zhao

Hello, we have tested the deepspeed=0.9.0 and it throws the same error as you face. We guess it is bug of the compatibility between the latest accelerate and deepspeed.

You could use deepspeed=0.8.3 accelerate=0.18.0, it could successfully run in our GPU cluster and should be able to run in yours.

Zeqiang-Lai avatar Apr 19 '23 06:04 Zeqiang-Lai

Thanks, it works

xiangchen-zhao avatar Apr 19 '23 20:04 xiangchen-zhao

same error image

gjm-anban avatar Sep 11 '23 07:09 gjm-anban

same error image

I tried this version, but it didn't work for me.

William9Baker avatar Sep 14 '23 08:09 William9Baker

In my case, initializing TrainingArguments() before from_pretrained() caused an error, but changing the order eliminated the error.

accelerate 0.23.0 deepspeed 0.11.0

lcw99 avatar Oct 22 '23 00:10 lcw99

As for my case, I found that I forgot to set the parameter in the deepspeed_config to "auto"

liya2001 avatar Oct 25 '23 09:10 liya2001