gpt-neox icon indicating copy to clipboard operation
gpt-neox copied to clipboard

loss stuck in overflow for RPE position embedding together with sparse attention

Open sweinbach opened this issue 4 years ago • 1 comments

Describe the bug Loss for RPE position embedding not going down

[2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 50 [2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 50 [2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1 [2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1 [2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1 [2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1 [2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 51 [2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1 [2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 51 [2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1 [2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1 [2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1 [2021-05-04 15:45:15,698] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 52 [2021-05-04 15:45:15,698] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1 [2021-05-04 15:45:15,699] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1 [2021-05-04 15:45:15,699] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 52 [2021-05-04 15:45:15,699] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1 [2021-05-04 15:45:15,699] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1

To Reproduce Steps to reproduce the behavior:

  1. Use small.yml and sparse.yml
  2. change position embedding to "rpe"
  3. train using deepy.py

sweinbach avatar May 04 '21 13:05 sweinbach

Note that this doesn't occur when running with dense attention as shown here

StellaAthena avatar May 04 '21 14:05 StellaAthena