DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Fix `PipelineEngine.eval_batch` result

Open nrailg opened this issue 1 year ago • 1 comments

With F16 enabled, PipelineEngine.eval_batch will not correctly broadcast loss. In last stage, eval_batch returns f16 loss, while in other stages, eval_batch will return noise.

def _bcast_pipe_scalar(self, data, src_rank=None, dtype=torch.float32):
    # Default to last stage (e.g., for broadcasting loss)
    if src_rank is None:
        src_rank = self.grid.stage_to_global(self.num_stages - 1)
    assert src_rank in self.grid.pp_group

    if self.global_rank == src_rank:
        result = data.clone().detach() # f16 tensor
    else:
        result = torch.Tensor([0.]).type(dtype).to(self.device) # f32 tensor

    # trying to broadcast a f16 tensor to f32 tensors here, and the result is noise.
    dist.broadcast(tensor=result, src=src_rank, group=self.mpu.get_pipe_parallel_group())

    return result

Environments:

  • torch 1.13.1
  • cuda 11.7
  • GPU A100 40GB + driver 450.80.02

nrailg avatar Apr 20 '23 08:04 nrailg

We have been working on LM recently, and encountered this problem. I am trying to fix it. @ShadenSmith @duli2012

nrailg avatar Apr 21 '23 07:04 nrailg