Megatron-LM [BUG] Checkpoint saving is slow for zarr backend + distributed optimizer

Describe the bug The distributed optimizer state is being saved in an inefficient way when zarr is used as a backend. This causes slowdowns like the following writes to a local SSD (holding everything constant for a 12 layer x 128 head_dim x 12 head llama-style transformer):

Float16OptimizerWithFloat16Params: 10 seconds
MixedPrecisionOptimizer: 64 seconds

After profiling the checkpoint saving workload, it looks like this is what happens for each parameter being saved in the optimizer state:

the entire optimizer state is being fetched into memory (the getitem part)
the full optimizer state is modified in a small region corresponding to that parameter
the full optimizer state is saved

This full process takes 450 ms, and is repeated many times-- once per parameter in the distributed optimizer.

To Reproduce

Spawn a GPTModel (with e.g. 12 layers, 128 head dim, 12 heads) on 2 x 2 x 2 pipeline x tensor x data partitions, then try to save the distributed optimizer as a checkpoint with the zarr fully_sharded_bucket_space backend.

Expected behavior This is nearly as fast as the saving the non-distributed optimizer, and the difference does not grow at a faster rate than the number of model parameters.

Environment (please complete the following information):

Megatron-LM commit ID 299f96ffe61a4bae9044a2082570b19b94d13335
PyTorch version 2.2.2
CUDA version 12.1.105
NCCL version 2.20.5

Proposed fix N/A

Additional context N/A

May 22 '24 23:05 chotzen

Have you tried the torch_dist (https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/training/arguments.py#L1252) distributed checkpoint format?

May 23 '24 00:05 deepakn94

yes I've tried it, we ran into some other issue regarding saving the optimizer step with dp_partitions >= 2. will file another bug for that when I have a chance to reproduce it.

May 23 '24 18:05 chotzen

Hi @deepakn94, which kinds of checkpoint resharding are meant to be supported for the torch_dist backend? I'm unable to load a (D, P, T) = (2, 2, 2) checkpoint into a (2, 1, 2) partitioning scheme with any combination of (torch_dist, 1) and sharding_type="dp_zero_gather_scatter" or not.

May 23 '24 22:05 chotzen

Hi @chotzen, please use the recommended torch_dist backend, especially for the DistributedOptimzer - zarr backend saving is very slow for DitOpt like sharding type

I'm unable to load a (D, P, T) = (2, 2, 2) checkpoint into a (2, 1, 2) partitioning scheme

Changing TPxPP is not supported with DistOpt yet (only DP for now), will be supported in the nearest future (target is MCore v0.8)

May 24 '24 12:05 mikolajblaz

Marking as stale. No activity in 60 days.

Jul 23 '24 18:07 github-actions[bot]