starcoder AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world

AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size

Open vishal-kvn opened this issue 1 year ago • 3 comments

I am working on FineTuning StarCoder by following the README in the /chat directory. I encounter the following Assertion error:

AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 256 != 4 * 8 * 1
	ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 91809) of binary: /home/ubuntu/anaconda3/envs/chat/bin/python

when I run

TRANSFORMERS_VERBOSITY=info torchrun --nproc_per_node=8 train.py config.yaml --deepspeed=deepspeed_z3_config_bf16.json

System info:

OS: Ubuntu 22.04.5
GPU count and types: 8 X A100 (80GB) GPUs
Python version: 3.10
deepspeed: 0.9.2
accelerate: 0.19.0

Has anyone encountered this issue? It looks very similar to the issue. Looks like the world_size in DeepSpeed package is always 1.

Any pointers will be greatly appreciated. Thanks in advance.

May 23 '23 19:05 vishal-kvn

Here are my notes from further investigating the issue. The RCA for the micro_batch_per_gpu * gradient_acc_step * world_size 256 != 4 * 8 * 1 is that the deepspeed environment is not being set up as a result of which the world_size is set to 1.

A DeepSpeed backend not set, please initialize it using init_process_group() exception is caught in the except block. One workaround for this is to call deepspeed.init_distributed() in the main function of train.py.

May 24 '23 17:05 vishal-kvn

I also

May 30 '23 07:05 amwork2020

A DeepSpeed backend not set, please initialize it using init_process_group() exception is caught in the except block. One workaround for this is to call deepspeed.init_distributed() in the main function of train.py.

It works for me.

Jun 08 '23 02:06 9iang22

starcoder starcoder copied to clipboard

AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size

starcoder
starcoder copied to clipboard