DeepSpeed [BUG] zero3 memory leak on return from training loop

Describe the bug

Hmm, I thought I fixed the leak here https://github.com/microsoft/DeepSpeed/pull/2665 but there is one more of a similar nature, where the model gets gathered, but not ungathered when the weights are no longer needed.

When entering the training loop the weights are sharded. But on exit they are no longer sharded, so the memory isn't getting freed. Say, you want to do something else with the gpu - e.g. additional eval - it's going to OOM most likely.

To Reproduce

$ cat test.py
import deepspeed
from deepspeed.ops.op_builder import CPUAdamBuilder

from unit.common import DistributedTest, DistributedFixture
from unit.simple_model import *
from unit.util import required_minimum_torch_version

from unit.checkpoint.common import *

import pytest

class TestZeROCheckpointNew(DistributedTest):
    world_size = 1

    def test_1_load_module_only_new(self):
        tmpdir = 'zero_stage'
        zero_stage = 3
        config_dict = {
            "train_batch_size": 2,
            "optimizer": {
                "type": 'Adam'
            },
            "fp16": {
                "enabled": True,
                "initial_scale_power": 8
            },
            "zero_optimization": {
                "stage": zero_stage,
            }
        }
        hidden_dim = 10

        with deepspeed.zero.Init(config_dict_or_path=config_dict):
            models = [SimpleModel(hidden_dim, empty_grad=False) for _ in range(2)]

        print("CREATE", models[0].linears[0].weight[:5])

        dtype = torch.half
        ds_model = create_deepspeed_model(config_dict=config_dict,
                                          model=models[0],
                                          base_optimizer=None)

        data_loader = random_dataloader(model=ds_model,
                                        total_samples=5,
                                        hidden_dim=hidden_dim,
                                        device=ds_model.device,
                                        dtype=dtype)

        print("ORIGIN 1", models[0].linears[0].weight)
        for _, batch in enumerate(data_loader):
            loss = ds_model(batch[0], batch[1])
            ds_model.backward(loss)
            ds_model.step()

        print("ORIGIN 2", models[0].linears[0].weight[0,:5])

PYTHONPATH=. pytest --torch_ver 1.13.1 --cuda_ver 11.7 test.py -sv

gives:

ORIGIN 1 Parameter containing:
tensor([], device='cuda:0', dtype=torch.float16, requires_grad=True)
ORIGIN 2 tensor([ 0.0699, -0.0974, -0.1665, -0.1251, -0.1195], device='cuda:0',
       dtype=torch.float16, grad_fn=<SliceBackward0>)

you can see that the model wasn't partitioned and memory leaked for the remainder of the program.

It should be tensor([], device='cuda:0', ...) after step maybe?

As far as what the solution should be, I think step should call partition internally, similar to the fix in https://github.com/microsoft/DeepSpeed/pull/2665 - it scatters the updated weights anyway.

@tjruwase

Mar 15 '23 04:03 stas00

@stas00, @satpalsr this issue is a bit more nuanced. The intended behavior for training is to cache parameters in GPU memory in anticipation of the next forward pass. This is why we don't explicitly release the parameters after step. This cache size and behavior is controlled by configuration knobs like stage3_max_live_parameters, stage3_prefetch_bucket_size, and stage3_max_reuse_distance. In contrast, GatherParameters designed for one-time computation with these parameters and so implicitly releases them afterwards.

@stas00, do you have a specific need for this behavior. Perhaps the solution is to provide an API for client to explicit flush the parameter cache.

@satpalsr, apologies for not responding before your PR.

Mar 15 '23 21:03 tjruwase

@stas00, do you have a specific need for this behavior. Perhaps the solution is to provide an API for client to explicit flush the parameter cache.

Yes, I'm thinking if the model needs to do something on the gpu after the training has finished it's now stuck with not having free space.

I think such an API would work well.

Mar 15 '23 21:03 stas00

But on exit they are no longer sharded, so the memory isn't getting freed.

Sorry for naive question, I was looking at this post video, After optimiser step, I see fp16 weights to be sharded across GPUs. Can we not cache parameters while they are being sharded?

Mar 16 '23 04:03 satpalsr

@satpalsr, by caching the parameters I mean not sharding them so that they available to use immediately without an allgather. Hope that helps.

Mar 16 '23 04:03 tjruwase

@stas00 and @satpalsr, please can you see if #3060 meets requirements?

Mar 20 '23 20:03 tjruwase

Thank you, Tunji. That works for me. Let's merge it

So one problem this caching leads to is that checkpoint save/loading tests might be broken, because the small test data is not partitioned at the point of saving, leading to invalid test logic - I'm trying to analyze those now and add asserts.

Mar 21 '23 19:03 stas00

@stas00, can you share more details about this checkpoint save/load breakage as I don't expect such failures. For context, we enforce conditions to ensure correct checkpoint load and save.

Mar 22 '23 10:03 tjruwase

I'm still trying to sort it out. But the tests aren't testing the right thing in some situations.

When running this test:

tests/unit/checkpoint/test_zero_optimizer.py -k test_load_module_only[3]

If you add a debug print

    for n,p in trained_model.module.named_parameters():
        print(f"trained unpartitioned {n} {p}")

before this line:

https://github.com/microsoft/DeepSpeed/blob/1286e374aba8b4490a7f57f18b3a479753985277/tests/unit/checkpoint/common.py#L182

You will see that the trained model is not partitioned under zero3. So it saves the checkpoint with real weights and thus we don't know whether it can actually recover them from optim states.

What we want to test here is a partitioned model being saved and that it can recover its "normal" weights on load_checkpoint

I forced partitioning with:

    # XXX: crude workaround: forces partitioning the model
    if 1:
        with deepspeed.zero.GatheredParameters(list(
                ds_model.module.parameters(recurse=True)),
                                               modifier_rank=0):
            pass

and it still recovered the weights correctly. I think the test needs to be fixed to ensure that whenever zero3 is tested no assumption is made about the model weights being partitioned, since in the small scale of tests most of the time they aren't - especially they aren't partitioned at the end of the training loop.

In the real world this won't be the case. A few params will be cached after training, but most params will be partitioned. And we need to test that such a model can resume on load_checkpoint.

Moreover unless I'm mistaken this test:

https://github.com/microsoft/DeepSpeed/blob/1286e374aba8b4490a7f57f18b3a479753985277/tests/unit/checkpoint/common.py#L32-L44

will happily succeed if it compares 0-sized tensors. I think it needs to assert if it sees placeholder tensors.

Mar 22 '23 19:03 stas00