DeepSpeed
DeepSpeed copied to clipboard
[BUG] assert all_groups_norm > 0 | Error related to Bf16 optimizer it seems
Describe the bug A clear and concise description of what the bug is.
Many of my trainings when in the config we have: bf16=True, around a few hundred steps the training crashes on the assertion error:
"assert all_groups_norm > 0"
To Reproduce Steps to reproduce the behavior: I am not entirely sure what causes it aside from having bf16 enabled sometimes. Here is the config I have been using
deepspeed: gradient_accumulation_steps: 1 steps_per_print: 2000 optimizer: type: "Adam" params: lr: 1.0e-4 betas: [0.9, 0.985] eps: 1.0e-8 weight_decay: 0.05
scheduler:
type: "WarmupDecayLR"
params:
warmup_min_lr: 0
warmup_max_lr: ${deepspeed.optimizer.params.lr}
warmup_num_steps: 250
warmup_type: "linear"
total_num_steps: 20000
gradient_clipping: 1.0
prescale_gradients: False
bf16:
enabled: True
wall_clock_breakdown: True
zero_optimization:
stage: 0
allgather_partitions: True
allgather_bucket_size: 2e8
overlap_comm: True
reduce_scatter: True
reduce_bucket_size: 2e8
contiguous_gradients: True
zero_quantized_nontrainable_weights: False
flops_profiler:
enabled: False
profile_step": 1
module_depth: -1
top_modules: 1
detailed: True
output_file: null
activation_checkpointing:
partition_activation: False # Enables partition activation when used with model parallelism
cpu_checkpointing: False
contiguous_memory_optimization: False
number_checkpoints: None
synchronize_checkpoint_boundary: False
profile: False
comms_logger:
enabled: True
verbose: False
prof_all: True
debug: False
Expected behavior A clear and concise description of what you expected to happen.
Not crash I think, I'm sure the assertion is there for a reason though.
ds_report output
Please run ds_report to give us details about your setup.
Screenshots
If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: Ubuntu 22.04
- GPU count and types: 2x A100 on a 8x node
- Interconnects (if applicable)
- Python version: 3.8.18
- Any other relevant info about your setup
Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?
No, using srun torchrun train.py --deepspeed
Docker context Are you using a specific docker image that you can share?
N/a
Additional context Add any other context about the problem here.
Only happens with some of the models i've trained
I also encountered this bug when using deepspeed==0.13.1 in Huggingface Transformers trainer. However, after I upgraded deepspeed to 0.13.5, the bug disappeared. So, maybe you can try
pip3 install deepspeed==0.13.5
I also encountered this bug when using deepspeed==0.13.1 in Huggingface Transformers trainer. However, after I upgraded deepspeed to 0.13.5, the bug disappeared. So, maybe you can try
pip3 install deepspeed==0.13.5
Still encountered this bug using 0.13.5 or 0.14.0 after switching to other machines.
I also encountered the error "assert all_groups_norm > 0", does anybody know how to solve it?
I think the error suggests vanishing gradient, but it's strange that I don't see it when using fp16 or full precision
I encountered the same error "assert all_groups_norm > 0", does anyone have a solution?
any solution?
I encountered the same issue. Is there any solution?
... /deepspeed/runtime/bf16_optimizer.py", line 312, in step [rank0]: assert all_groups_norm > 0. [rank0]: AssertionError
deepspeed 0.15.0 transformers 4.44.2
I encountered the same issue. Is there any solution?
... /deepspeed/runtime/bf16_optimizer.py", line 312, in step [rank0]: assert all_groups_norm > 0. [rank0]: AssertionError
deepspeed 0.15.0 transformers 4.44.2
I resolved this issue. In my case, the cause was not related to the versions of Deepspeed, Tranformers, or other dependencies. The problem was with the model checkpoint for "clip-vit-large-patch14", which seemed to be corrupted, though I'm not sure why. After re-downloading it from Huggingface, the issue was clearly resolved.
The above issue occured when I use Deepspeed with zero0.json. When I use zero1 or zero2, the issue of loss being zero and grad_norm=NaN occured.
Now, all issues are clearly resolved after re-downloading the CLIP model.
What's your weight_decay setting? 1e-2 can be too large for certain tasks, especially in highly unbalanced classification tasks.