Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[BUG] Checkpoint saving is slow for zarr backend + distributed optimizer

Open chotzen opened this issue 1 year ago • 4 comments

Describe the bug The distributed optimizer state is being saved in an inefficient way when zarr is used as a backend. This causes slowdowns like the following writes to a local SSD (holding everything constant for a 12 layer x 128 head_dim x 12 head llama-style transformer):

  • Float16OptimizerWithFloat16Params: 10 seconds
  • MixedPrecisionOptimizer: 64 seconds
image

After profiling the checkpoint saving workload, it looks like this is what happens for each parameter being saved in the optimizer state:

  • the entire optimizer state is being fetched into memory (the getitem part)
  • the full optimizer state is modified in a small region corresponding to that parameter
  • the full optimizer state is saved

This full process takes 450 ms, and is repeated many times-- once per parameter in the distributed optimizer.

To Reproduce

Spawn a GPTModel (with e.g. 12 layers, 128 head dim, 12 heads) on 2 x 2 x 2 pipeline x tensor x data partitions, then try to save the distributed optimizer as a checkpoint with the zarr fully_sharded_bucket_space backend.

Expected behavior This is nearly as fast as the saving the non-distributed optimizer, and the difference does not grow at a faster rate than the number of model parameters.

Environment (please complete the following information):

  • Megatron-LM commit ID 299f96ffe61a4bae9044a2082570b19b94d13335
  • PyTorch version 2.2.2
  • CUDA version 12.1.105
  • NCCL version 2.20.5

Proposed fix N/A

Additional context N/A

chotzen avatar May 22 '24 23:05 chotzen

Have you tried the torch_dist (https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/training/arguments.py#L1252) distributed checkpoint format?

deepakn94 avatar May 23 '24 00:05 deepakn94

yes I've tried it, we ran into some other issue regarding saving the optimizer step with dp_partitions >= 2. will file another bug for that when I have a chance to reproduce it.

chotzen avatar May 23 '24 18:05 chotzen

Hi @deepakn94, which kinds of checkpoint resharding are meant to be supported for the torch_dist backend? I'm unable to load a (D, P, T) = (2, 2, 2) checkpoint into a (2, 1, 2) partitioning scheme with any combination of (torch_dist, 1) and sharding_type="dp_zero_gather_scatter" or not.

chotzen avatar May 23 '24 22:05 chotzen

Hi @chotzen, please use the recommended torch_dist backend, especially for the DistributedOptimzer - zarr backend saving is very slow for DitOpt like sharding type

I'm unable to load a (D, P, T) = (2, 2, 2) checkpoint into a (2, 1, 2) partitioning scheme

Changing TPxPP is not supported with DistOpt yet (only DP for now), will be supported in the nearest future (target is MCore v0.8)

mikolajblaz avatar May 24 '24 12:05 mikolajblaz

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Jul 23 '24 18:07 github-actions[bot]