DeepSpeed [BUG] nested `zero.Init` in real models leads to an infinite recursion

Describe the bug

Repro

Nested zero.Init leads to an infinite recursion:

import torch
import deepspeed

ds_config = dict(train_batch_size=1, zero_optimization=dict(stage=3))

class MyModel(torch.nn.Module):
    def __init__(self, m1):
        super().__init__()
        self.m1 = m1

with deepspeed.zero.Init(config_dict_or_path=ds_config):

    with deepspeed.zero.Init(config_dict_or_path=ds_config):
        m1 = torch.nn.Linear(1,1)

    model = MyModel(m1)

deepspeed_engine, *_ = deepspeed.initialize(model=model, config_params=ds_config)

$ deepspeed --num_gpus 1 test2.py
Traceback (most recent call last):
  File "test2.py", line 18, in <module>
    deepspeed_engine, *_ = deepspeed.initialize(model=model, config_params=ds_config)
  File "/mnt/nvme0/code/github/00optimize/DeepSpeed-optim-grad-accessors/deepspeed/__init__.py", line 125, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/mnt/nvme0/code/github/00optimize/DeepSpeed-optim-grad-accessors/deepspeed/runtime/zero/partition_parameters.py", line 352, in wrapper
    if not hasattr(module, "_ds_child_entered"):
  File "/mnt/nvme0/code/github/00optimize/DeepSpeed-optim-grad-accessors/deepspeed/runtime/engine.py", line 490, in __getattr__
    if name in dir(self):
  File "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2026, in __dir__
    parameters = list(self._parameters.keys())
[...]
  File "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2026, in __dir__
    parameters = list(self._parameters.keys())
  File "/mnt/nvme0/code/github/00optimize/DeepSpeed-optim-grad-accessors/deepspeed/runtime/engine.py", line 490, in __getattr__
    if name in dir(self):
  File "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2024, in __dir__
    module_attrs = dir(self.__class__)
RecursionError: maximum recursion depth exceeded while calling a Python object

Now, who would write code like that, right?

Real examples

But modern models are becoming complicated, so many CV and multi-modal models fall into a category where they are instantiated from multiple other pretrained models.

For example in this user report https://github.com/huggingface/transformers/issues/21538 they try to use a DONUT model which underneath is a VisionEncoderDecoderModel

Here is the reduced version of their code that breaks:

from transformers import DonutProcessor, VisionEncoderDecoderModel
import torch
import deepspeed
from transformers.deepspeed import HfDeepSpeedConfig

ds_config = dict(train_batch_size=1, zero_optimization=dict(stage=3))

dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive

model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")

deepspeed_engine, *_ = deepspeed.initialize(model=model, config_params=ds_config)

You can see - there is nothing bad about this code.

When running this script - things go boom (infinite recursion), exactly as in the simple repro script at the beginning of this post.

and so what happens inside of it is this:

VisionEncoderDecoderModel.from_pretrained(...)

which calls this constructor inside zero.Init

class VisionEncoderDecoderModel(...)::
    def __init__(self):
            encoder = AutoModel.from_config(config.encoder)
            decoder = AutoModelForCausalLM.from_config(config.decoder)

where from_config calls zero.Init too! Nesting!

Here are 2 more reports of the same:

https://github.com/microsoft/DeepSpeedExamples/issues/84
https://github.com/huggingface/transformers/issues/21326

Fixing?

Now can we fix this in deepspeed or is the only way is for the integrators to add some kind of tracking if these calls getting stacked and ensure that there is only one zero.Init call? For example if I change the repro to:


with deepspeed.zero.Init(config_dict_or_path=ds_config):
    m1 = torch.nn.Linear(1,1)
    model = MyModel(m1)

deepspeed_engine, *_ = deepspeed.initialize(model=model, config_params=ds_config)

it no longer recurses infinitely. But I'm not certain it actually works later.

I'm pretty sure I saw this:

    with deepspeed.zero.Init(config_dict_or_path=ds_config):
        m1 = torch.nn.Linear(1,1)
    with deepspeed.zero.Init(config_dict_or_path=ds_config):
        m2 = torch.nn.Linear(1,1)

while passing deepspeed.initialize ok, then failing to work at train time - I need to write more tests to explore that use case.

So for now let's focus on just one thing - nested zero.Init.

The tricky part is that the user can't always guarantee they can control the various contexts, so ideally we want to make zero.Init impervious to nesting issues.

Now even if I fix this problem manually and remove the nested zero.Iinit calls, I the immediately run into problem number 2 with this exact setup and that problem is dynamic module importing inside zero.Init context.

I broke the problem down here: https://github.com/microsoft/DeepSpeed/issues/2812

Thank you!

@tjruwase, @samyam

Feb 10 '23 06:02 stas00

Hi @stas00, thank you for your reporting.

I could reproduce the error regarding the nested zero.Init and investigated what's happening. deepspeed.zero.Init adds a wrapper to a constructor of each subclasss of torch.nn.Module to partition parameters. The wrapper is removed when exiting the with block of deepspeed.zero.Init.

When deepspeed.zero.Init is nested, however, the removal of the wrapper does not properly work and the wrapper is invoked for all constructor calls of subclasses of torch.nn.Module, even for DeepSpeedEngine (DeepSpeedEngine is a subclass of torch.nn.Module). This causes the endless recursive calls.

The straightforward way is to track the reentrant calls and to disable them after the first call. I will try to fix by the approach but let me consider more after I investigate #2812.

I could also cause an error with this code.

    with deepspeed.zero.Init(config_dict_or_path=ds_config):
        m1 = torch.nn.Linear(1,1)
    with deepspeed.zero.Init(config_dict_or_path=ds_config):
        m2 = torch.nn.Linear(1,1)

I encountered the following error hen I run a forward pass with m2:

RuntimeError: mat2 must be a matrix, got 1-D tensor

Please let me know if this is not the one you encountered. This error seems to have a different reason from the nested Init. I will take a look now.

Feb 17 '23 01:02 tohtana

Hi @tohtana,

Thank you for investigating this issue.

I think there are multiple issues with nested zero.Init, so yes, I was planning to write yet another report for 2 adjacent, but not nested calls just like you did, but didn't have the time to properly investigate.

So yes, we have at least 2 problems to fix here

nested zero.Init - this OP
multiple zero.Init calls - your 2 calls example in your comment above

perhaps you could file a report about your bug so that it doesn't fall between the cracks?

And yes, making zero.Init re-entrant should fix the 1st issue.

https://github.com/microsoft/DeepSpeed/issues/2812 ties closely into this group of problems related to multiple model inits at once. But it's stand apart. If you disable the re-entrant calls, it won't know to inject the wrappers for the new models that get zero.Init wrapped in a nested fashion as I have presented in https://github.com/microsoft/DeepSpeed/issues/2812. this means that effectively those models will be loaded on cpu and are likely to cpu oom for users with little cpu memory (more so if each process on a multi-gpu setup does it in parallel).

Therefore wrt #2812 I think a possible solution would be not to skip zero.Init's functionality if it's re-entrant, but to check which new nn.Module modules were added and if they don't have the DS hooks yet, insert them too.

Feb 17 '23 02:02 stas00

Hello @stas00,

Thank you for the inputs. they are very helpful. As for the following pattern, the code I wrote for reproducing an issue was wrong. Sorry for that.

  with deepspeed.zero.Init(config_dict_or_path=ds_config):
        m1 = torch.nn.Linear(1,1)
    with deepspeed.zero.Init(config_dict_or_path=ds_config):
        m2 = torch.nn.Linear(1,1)

I ran forward and backward passes with the models, but I didn't see any error now. You wrote this code failed at training. I would appreciate it if you could give me more details.

I'm also working on #2812. I will report as soon as I get some progress.

Feb 18 '23 02:02 tohtana

Ah, OK, thank you for retesting, @tohtana

I'm pretty sure the above didn't work with 2 real models, perhaps try 2 real models from transformers? or we can just wait until I get enough time to dive into the 3rd issue.

Feb 18 '23 02:02 stas00

Fixed by #2989. Will open new issue for #3202 as needed.

Apr 21 '23 18:04 tjruwase