InternImage AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world

AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size

Open xiangchen-zhao opened this issue 1 year ago • 6 comments

I am trying to run the scripts you provide in "Huggingface Accelerate Integration of Deepspeed" accelerate launch --config_file configs/accelerate/dist_8gpus_zero3_offload.yaml main_accelerate.py --cfg configs/internimage_t_1k_224.yaml --data-path /mnt/lustre/share/images --batch-size 128 --accumulation-steps 4 --output output_zero3_offload However, I got an AssertionError AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 4096 != 128 * 4 * 1

And I found this error is caused by an exception in deepspeed package (deepspeed/runtime/config.py Line691)

I'm wondering if it's a version issue. Could you give the version of accelerate and deepspeed you used? Thanks!

my environment: CUDA: 11.3 torch: 1.11.0+cu113 python: 3.7.16 accelerator:0.18.0 deepspeed:0.9.0 Ubuntu: 18.04 8 * NVIDIA A10G NVIDIA-SMI 510.47.03

Apr 19 '23 00:04 xiangchen-zhao

Hello, we have tested the deepspeed=0.9.0 and it throws the same error as you face. We guess it is bug of the compatibility between the latest accelerate and deepspeed.

You could use deepspeed=0.8.3 accelerate=0.18.0, it could successfully run in our GPU cluster and should be able to run in yours.