starcoder
starcoder copied to clipboard
AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size
Hi
I am working on FineTuning StarCoder by following the README in the /chat
directory. I encounter the following Assertion error:
AssertionError: Check batch related parameters. train_batch_size is not equal to micro_batch_per_gpu * gradient_acc_step * world_size 256 != 4 * 8 * 1
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 91809) of binary: /home/ubuntu/anaconda3/envs/chat/bin/python
when I run
TRANSFORMERS_VERBOSITY=info torchrun --nproc_per_node=8 train.py config.yaml --deepspeed=deepspeed_z3_config_bf16.json
System info:
- OS: Ubuntu 22.04.5
- GPU count and types: 8 X A100 (80GB) GPUs
- Python version: 3.10
- deepspeed: 0.9.2
- accelerate: 0.19.0
Has anyone encountered this issue? It looks very similar to the issue. Looks like the world_size in DeepSpeed package is always 1.
Any pointers will be greatly appreciated. Thanks in advance.
Here are my notes from further investigating the issue.
The RCA for the micro_batch_per_gpu * gradient_acc_step * world_size 256 != 4 * 8 * 1
is that the deepspeed environment is not being set up as a result of which the world_size is set to 1.
A DeepSpeed backend not set, please initialize it using init_process_group()
exception is caught in the except block. One workaround for this is to call deepspeed.init_distributed()
in the main function of train.py
.
I also
A
DeepSpeed backend not set, please initialize it using init_process_group()
exception is caught in the except block. One workaround for this is to calldeepspeed.init_distributed()
in the main function oftrain.py
.
It works for me.