DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] Does bf16 support Zero stage 1 with pipeline?

Open zui-jiang opened this issue 1 year ago • 1 comments

Describe the bug I'm using Deepspeed-Megatron although, using pipeline parallelism and setting

"bf16": {
    "enabled": "auto"
  }

will step into the NotImplementedError in

#/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py
def _exec_reduce_grads(self):
      self._force_grad_boundary = True
      if self.pipeline_enable_backward_allreduce:
          if self.bfloat16_enabled():
              if self.zero_optimization_stage() == 0:
                  self._bf16_reduce_grads()
              else:
                  assert self.zero_optimization_stage() == 1, "only bf16 + z1 are supported"
                  raise NotImplementedError()
          else:
              self.allreduce_gradients(bucket_size=MEMORY_OPT_ALLREDUCE_SIZE)
      self._force_grad_boundary = False

But when using transformer integrated deepspeed with Zero Stage 1/2/3. It work fine The only diff I found is that, in megatron-deepspeed the model was a subclass of PipelineModule whereas in transformer is not

I wonder whether deepspeed now support pipeline bf16 with zero stage 1 or just my code mistakes

zui-jiang avatar Mar 10 '23 07:03 zui-jiang

@lyj201002, thanks for reporting this bug. We are working on a fix.

tjruwase avatar Mar 10 '23 08:03 tjruwase