DeepSpeed
DeepSpeed copied to clipboard
uniform deepspeed overflow check
Before: Overflow check is scattered and duplicated in all places.
This PR:
- Single interface as CheckOverflow class, which abstract and uniform overflow check among ZeRO, ZeRO-Offload, Pipeline Parallelism, BF16_optimizer.
- Skip step() operation if detect gradients overflow in BF6_optimizer. (avoid polluting checkpoint, etc)
cc @tjruwase
Why not using tensor.isnan() and tensor.isinf()?