gpt-neox
gpt-neox copied to clipboard
loss stuck in overflow for RPE position embedding together with sparse attention
Describe the bug Loss for RPE position embedding not going down
[2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 50 [2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 50 [2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1 [2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1 [2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1 [2021-05-04 15:45:14,710] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1 [2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 51 [2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1 [2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 51 [2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1 [2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1 [2021-05-04 15:45:15,204] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1 [2021-05-04 15:45:15,698] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 52 [2021-05-04 15:45:15,698] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1 [2021-05-04 15:45:15,699] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1 [2021-05-04 15:45:15,699] [INFO] [unfused_optimizer.py:246:_update_scale] Grad overflow on iteration: 52 [2021-05-04 15:45:15,699] [INFO] [unfused_optimizer.py:247:_update_scale] Reducing dynamic loss scale from 1 to 1 [2021-05-04 15:45:15,699] [INFO] [unfused_optimizer.py:171:step] [deepspeed] fp16 dynamic loss scale overflow! Skipping step. Attempted loss scale: 1, reducing to 1
To Reproduce Steps to reproduce the behavior:
- Use small.yml and sparse.yml
- change position embedding to "rpe"
- train using deepy.py
Note that this doesn't occur when running with dense attention as shown here