DeepSpeed Question about fp16 dynamic loss scale overflow in DeepSpeed

Hi, I have some issues when using DeepSpeed for mixed precision training (fp16 with dynamic scaling). I would genuinely appreciate it if you could please give me some help on this issue.

Currently, I am using DeepSpeed for mixed precision training (fp16 with dynamic scaling) to reproduce CodeBERT. When setting lr=1e-4 (a relatively small learning rate), everything goes on well: the training loss curve went smoothly, and there only exists very few fp16 dynamic loss scale overflow. However, When setting lr=5e-4 (a relatively large learning rate, following the original paper's setting), after training for 10k steps, the training loss rises dramatically and the validation accuracy after the huge increase drops to a very low level (even 0). Before the increase happens, the dynamic loss scale gradually rises to 32768.0; after the huge increase, the dynamic loss scale is reduced to 1024.0 step by step, but the training loss is still abnormal.

This is my DeepSpeed configurations:

    ds_config = {
        "train_batch_size": args.batch_size,
        "train_micro_batch_size_per_gpu": args.micro_batch_size,
        "steps_per_print": 10000,
        "gradient_clipping": 1.0,
        "wall_clock_breakdown": False,
        "fp16": {
            "enabled": args.fp16,
            "loss_scale": 0.0,
            "loss_scale_window": 1000,
            "hysteresis": 2,
            "min_loss_scale": 1
        },
        "local_rank": args.local_rank,
        "zero_optimization": {
            "stage3_gather_16bit_weights_on_model_save": True
        }
    }

This is the picture that shows the training loss and the learning rate before and after the huge increase:

I've been working on this problem for a while, but I failed to solve it. Could you please give me some advice on this issue? (i.e. what could be the cause of this problem? will it help to enlarge loss_scale_window, limit the dynamic loss scale to a small number, or change other DeepSpeed configurations?)

Jul 07 '22 16:07 natedingyifeng

@natedingyifeng, apologies for the delayed response. Debugging this kind of issue is notoriously difficult because there are many factors and interactions involved. Large learning rates do tend to cause training instability, especially with mixed precision training. Configuring the loss scale parameters to address instability is often trial-and-error and thus difficult. Have you tried reaching out to the CodeBert authors to find out if they encountered this kind of problem and to share their loss scaling recipe for mixed precision training?

Jul 27 '22 14:07 tjruwase

@natedingyifeng have you resloved this promblem?we encount same problem

Aug 04 '22 07:08 rtygbwwwerr

Any updates on this, or any suggestions @tjruwase ? I am facing the same issue while training GPT-J from HuggingFace with deepspeed. I have tried with learning rate 5e-05 over 2 old Tesla V100 GPUs of 32GB each.

May 08 '23 15:05 SanchiMittal

Same issue here with V100

Jul 24 '23 04:07 lucasjinreal

@lucasjinreal, @SanchiMittal, this thread went cold due to lack of response from OP. Like I shared earlier, this is a very challenging problem to debug. It seems you are running a different model, so it is unclear whether your issues have same root cause. Can you share more details of your own experience and if you are following an existing training recipe.

Jul 24 '23 14:07 tjruwase

For anyone's reference, I seems able to continue training with fp16 by narrow down the learing rate.

I think the root caused was the lr rate too sensetive to fp16

Jul 25 '23 02:07 lucasjinreal

@lucasjinreal, thanks for the update. And that makes sense since hyperparameter tuning is well known to be tricky.

Jul 25 '23 13:07 tjruwase

DeepSpeed DeepSpeed copied to clipboard

Question about fp16 dynamic loss scale overflow in DeepSpeed

DeepSpeed
DeepSpeed copied to clipboard