DeepSpeed
DeepSpeed copied to clipboard
[MiCS] [Fix] saving and loading model checkpoint logic for MiCS sharding
- Only the first partition group will save the model checkpoints
- Need to avoid call
dist.barrier
on theWORLD
group. - Including the support for loading the partitioned model checkpoint as ZeRO-3
Hi there, one of the unit test is failed, but I didn't see the corresponding error.
Can someone help to retry with the nv-torch19-v100 / unit-tests
Hi @samadejacobs @tjruwase
would you mind taking a look at this PR? It is an easy fix.
The only thing that can affect the current logic of ZeRO-3 is the in deepspeed/runtime/engine.py: _create_zero_checkpoint_files
, in which I let the dist.barrier
function get called on ranks within the self.optimizer.dp_process_group
.
Hi @tjruwase @jeffra @awan-10 would you mind take a look at the modification? It is pretty minimal. Thanks!
@zarzen, I am unable to rebase this PR. Did you set some restrictions on your branch?
I think it is due to the permission policy on dmlc
account.
I just rebased with the master branch. Will take a look at the permission setup.