DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[MiCS] [Fix] saving and loading model checkpoint logic for MiCS sharding

Open zarzen opened this issue 1 year ago • 2 comments

  • Only the first partition group will save the model checkpoints
  • Need to avoid call dist.barrier on the WORLD group.
  • Including the support for loading the partitioned model checkpoint as ZeRO-3

zarzen avatar May 04 '23 06:05 zarzen

Hi there, one of the unit test is failed, but I didn't see the corresponding error. Can someone help to retry with the nv-torch19-v100 / unit-tests

zarzen avatar May 05 '23 04:05 zarzen

Hi @samadejacobs @tjruwase would you mind taking a look at this PR? It is an easy fix. The only thing that can affect the current logic of ZeRO-3 is the in deepspeed/runtime/engine.py: _create_zero_checkpoint_files, in which I let the dist.barrier function get called on ranks within the self.optimizer.dp_process_group.

zarzen avatar May 08 '23 18:05 zarzen

Hi @tjruwase @jeffra @awan-10 would you mind take a look at the modification? It is pretty minimal. Thanks!

zarzen avatar May 30 '23 17:05 zarzen

@zarzen, I am unable to rebase this PR. Did you set some restrictions on your branch?

tjruwase avatar Jun 02 '23 19:06 tjruwase

I think it is due to the permission policy on dmlc account. I just rebased with the master branch. Will take a look at the permission setup.

zarzen avatar Jun 02 '23 21:06 zarzen