DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

uniform deepspeed overflow check

Open GuanhuaWang opened this issue 10 months ago • 1 comments

Before: Overflow check is scattered and duplicated in all places.

This PR:

  • Single interface as CheckOverflow class, which abstract and uniform overflow check among ZeRO, ZeRO-Offload, Pipeline Parallelism, BF16_optimizer.
  • Skip step() operation if detect gradients overflow in BF6_optimizer. (avoid polluting checkpoint, etc)

cc @tjruwase

GuanhuaWang avatar Apr 16 '24 22:04 GuanhuaWang

Why not using tensor.isnan() and tensor.isinf()?

Anhelor avatar Apr 20 '24 07:04 Anhelor