DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

fix load_optimizer_states for MoE (#2737)

Open clumsy opened this issue 2 years ago • 1 comments

When load_optimizer_states=False is used for MoE load_checkpoint - do not attempt to load the optimizer state files.

This currently fails as DeepSpeed still attempts to load those, even though they are not used afterwards.

Adding parameterized unit tests for various cases.

Verified via: pytest tests/unit/checkpoint/test_moe_checkpoint.py -k 'test_checkpoint_moe_and_zero'

= 6 passed, 1 deselected, 102 warnings in 156.67s (0:02:36) =

clumsy avatar Jan 23 '23 17:01 clumsy

Hi @tjruwase, I almost got this to work, but for some reason when I suppress loading optimizer states for Stage3 tensor correctness fails for model parameters in unit test. Do you have an idea why? Is there's something besides optimizer states in Stage3 ZeRO optimizer states dictionary or is there some side-effect of loading Stage3 optimizer for model parameters? It's surprising it works for stages 0, 1, 2 just fine.

clumsy avatar May 04 '23 21:05 clumsy