transformers Deepspeed Stage3 using trainer and base DONUT model results in RecursionError.

System Info

Running on AzureML Standard_NC6S_V3 with curated environment: AzureML-ACPT-pytorch-1.12-py39-cuda11.6-gpu
transformers version: 4.26.0
Platform: Linux-5.0.0-1036-azure-x86_64-with-glibc2.31
Python version: 3.9.15
Huggingface_hub version: 0.12.0
PyTorch version (GPU?): 1.12.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Through trainer
Using distributed or parallel set-up in script?: Through deepspeed/trainer

Who can help?

I am using a base DONUT model, The error only happens with Deepspeed stage3: @stas00

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I am fine-tuning a DONUT based model on an Azure Standard_NC6S_V3 (1 x V100 (16GB)) using AzureML. Below is a minimal example to reproduce the recursion error.

# Train script
import transformers
from transformers import (
    DonutProcessor,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    VisionEncoderDecoderModel,
)
from PIL import Image
import datasets


base_model = "naver-clova-ix/donut-base"
image_size = { "width": 680, "height": 960 }


def main():
    # Main
    training_args = Seq2SeqTrainingArguments(
        output_dir='./output',
        num_train_epochs=1,
        per_device_train_batch_size=2,
        fp16=True,
        deepspeed='deepspeed_config.json',
    )

    model = VisionEncoderDecoderModel.from_pretrained(base_model)
    processor = DonutProcessor.from_pretrained(base_model)
    
    # Resize image size in model/processor
    processor.image_processor.size = image_size
    model.config.encoder.image_size = tuple(processor.image_processor.size.values())[::-1]
    model.config.hidden_size = model.config.encoder.hidden_size  # Deepspeed needs this fix


    # Generate bogus dataset
    image = Image.new('RGB', (image_size['width'], image_size['height']))
    text = '{"great_key": "great_value"}'
    N = 16
    data = [{'image': image, 'text': text} for _ in range(N)]
    dataset = datasets.Dataset.from_list(data)

    # Tokenize bogus dataset
    def tokenize(example, processor):
        pixel_values = processor(
            example["image"],
            random_padding=True,
            return_tensors="pt",
        ).pixel_values.squeeze()

        input_ids = processor.tokenizer(  # type: ignore
            example["text"],
            add_special_tokens=False,
            max_length=512,
            padding="max_length",
            truncation=True,
            return_tensors="pt",
        )["input_ids"].squeeze(0)

        labels = input_ids.clone()

        return {
            "pixel_values": pixel_values,
            "labels": labels,
            "target_sequence": example["text"],
        }

    input_dataset = dataset.map(
        lambda x: tokenize(x, processor),
        remove_columns=['image', 'text'],
    )

    # Train
    trainer = Seq2SeqTrainer(
        model=model,
        args=training_args,
        train_dataset=input_dataset,
    )

    trainer.remove_callback(transformers.integrations.AzureMLCallback)

    trainer.train()

if __name__ == "__main__":
    main()

{
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "train_batch_size": "auto", 
  "fp16": {
        "enabled": "auto"
  }
}

Probably not relevant but here the submit job script.

from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml import command

compute_name = ""
environment_name = ""


ml_client = MLClient.from_config(
    credential=DefaultAzureCredential(),
    path='/', 
)
environment = ml_client.environments.get(environment_name, label="latest")

fail_job = command(
    code='./fail_train',
    command="transformers-cli env && deepspeed --num_gpus 1 failure_train_script.py",
    compute=compute_name,
    environment=environment,
)

job = ml_client.jobs.create_or_update(
    fail_job,
    experiment_name="testing",
)

Expected behavior

When using Deepspeed stage2 all is working but for large images I get OOM on the V100 16GB GPU. Therefore, I want to try Deepspeed stage3 but this results in the maximum recursion error.

From what I have read, the recursion error is due to deepspeed's zero initialisation, however these bits are a bit hidden when using trainer and I am not sure where to look. I am more than happy to investigate but I definitely need some guidance (-:

I expect training to start with hopefully some memory savings such that I can train a DONUT based model on V100 or smaller GPU.

Feb 09 '23 12:02 dennisbakhuis

Hi @dennisbakhuis

In a bit we will move this to https://github.com/microsoft/DeepSpeed/issues as this is not the integration problem.

As I discovered this recently when trying to build a multi-modal model based on 2 pre-trained models you can only use zero.Init once. If you use it again it breaks (infinite recursion) in deepspeed.initialize.

p.s. I edited the OP to remove other maintainers since this ticket is mine ;)

But let's try to unravel it here first and I think I have a workaround for you as well.

Feb 10 '23 02:02 stas00

Meanwhile the workaround I did is this: as one of the models was much smaller than the other I initialized the smaller one w/o zero.Init and the other normally w/ zero.Init and it worked. Is this a similar situation here, and the processor is much smaller than the cv model?

I could rig up a workaround for you. But your situation is different - you have 2 separate models. Let me think.

ok, processor is not a model, so it shouldn't even be called under zero.Init in the first place. this is interesting!

Feb 10 '23 02:02 stas00

The minimal repro is just this:

from transformers import DonutProcessor, VisionEncoderDecoderModel
import torch
import deepspeed
from transformers.deepspeed import HfDeepSpeedConfig

ds_config = dict(train_batch_size=1, zero_optimization=dict(stage=3))

dschf = HfDeepSpeedConfig(ds_config)  # keep this object alive

model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")

deepspeed_engine, *_ = deepspeed.initialize(model=model, config_params=ds_config)

$ deepspeed --num_gpus 1 test.py
Traceback (most recent call last):
  File "test.py", line 13, in <module>
    deepspeed_engine, *_ = deepspeed.initialize(model=model, config_params=ds_config)
  File "/mnt/nvme0/code/github/00optimize/DeepSpeed-optim-grad-accessors/deepspeed/__init__.py", line 125, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/mnt/nvme0/code/github/00optimize/DeepSpeed-optim-grad-accessors/deepspeed/runtime/zero/partition_parameters.py", line 350, in wrapper
    if not hasattr(module, "_ds_child_entered"):
  File "/mnt/nvme0/code/github/00optimize/DeepSpeed-optim-grad-accessors/deepspeed/runtime/engine.py", line 490, in __getattr__
[...]
    if name in dir(self):
  File "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2026, in __dir__
    parameters = list(self._parameters.keys())
  File "/mnt/nvme0/code/github/00optimize/DeepSpeed-optim-grad-accessors/deepspeed/runtime/engine.py", line 490, in __getattr__
    if name in dir(self):
  File "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2026, in __dir__
    parameters = list(self._parameters.keys())
  File "/mnt/nvme0/code/github/00optimize/DeepSpeed-optim-grad-accessors/deepspeed/runtime/engine.py", line 490, in __getattr__
    if name in dir(self):
  File "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2024, in __dir__
    module_attrs = dir(self.__class__)
RecursionError: maximum recursion depth exceeded while calling a Python object

The initial thought that 2 from_pretrained caused it - isn't the case, the problem is somewhere in the from_pretrained of this model.

Feb 10 '23 03:02 stas00

The cause proved to be 2 from_config calls each invoking zero.Init context internally.

https://github.com/huggingface/transformers/blob/97d3390fc8edb210fcf0aad6a079406b018655b9/src/transformers/models/vision_encoder_decoder/modeling_vision_encoder_decoder.py#L191-L195

Feb 10 '23 03:02 stas00

BTW, do you have enough cpu memory to load this model?

In this case a temp hack would be very simple, just disable the zero.Init contexts directly:

diff --git a/src/transformers/modeling_utils.py b/src/transformers/modeling_utils.py
index c9d304f25..c2e530275 100644
--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -1085,7 +1085,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
         if torch_dtype is not None:
             dtype_orig = cls._set_default_torch_dtype(torch_dtype)

-        if is_deepspeed_zero3_enabled():
+        if 0: # is_deepspeed_zero3_enabled():
             import deepspeed

             logger.info("Detected DeepSpeed ZeRO-3: activating zero.init() for this model")
@@ -2453,7 +2453,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
         # Instantiate model.
         init_contexts = [no_init_weights(_enable=_fast_init)]

-        if is_deepspeed_zero3_enabled():
+        if 0: #is_deepspeed_zero3_enabled():
             import deepspeed

             logger.info("Detected DeepSpeed ZeRO-3: activating zero.init() for this model")

This should unblock you. Let me know if it does.

The accelerate deepspeed integration has separated zero3 and zero.Init, which was a smart move, so with a single flag you can disable zero.Init, while still using zero3. When designing the HF Trainer integration I made a wrong assumption that someone wanting to use zero3 would always want to use zero.Init` but as you can see there are rare cases when it's the case.

Meanwhile I will try to reduce this to a simple test case that we can present to the Deepspeed team to make it work.

Feb 10 '23 03:02 stas00

OK, I reduced the problem to this repro:

import torch
import deepspeed

ds_config = dict(train_batch_size=1, zero_optimization=dict(stage=3))

class MyModel(torch.nn.Module):
    def __init__(self, m1):
        super().__init__()
        self.m1 = m1

with deepspeed.zero.Init(config_dict_or_path=ds_config):

    with deepspeed.zero.Init(config_dict_or_path=ds_config):
        m1 = torch.nn.Linear(1,1)

    model = MyModel(m1, m1)

deepspeed_engine, *_ = deepspeed.initialize(model=model, config_params=ds_config)

Feb 10 '23 06:02 stas00

OK, I filed the report here: https://github.com/microsoft/DeepSpeed/issues/2811

Feb 10 '23 06:02 stas00

Hi @stas00,

Thanks for the elaborate answers and way of thought.

Let me rephrase from what I understood: deepspeed.zero.Init should only be called once. This is something I have seen mentioned in other issues in the Deepspeed repo as well. As we have an encoder + decoder, we practically have two models, which each do a deepspeed.zero.init during the .from_config method.

What is unclear to me is who to "blame" (in a positive sense (-;). If we are only suppose to call deepspeed.zero.init once, something in transformers should be fixed, while if nested deepspeed.zero.init should be allowed (as in your minimal example), Deepspeed needs a fix.

Just thinking out loud.

I will try your suggested hacky fix and will report later.

Feb 10 '23 08:02 dennisbakhuis

deepspeed.zero.Init should only be called once

at the moment, yes

What is unclear to me is who to "blame" (in a positive sense (-;). ...

If you read my bug report https://github.com/microsoft/DeepSpeed/issues/2811 it already asks your exact questions:

And there is a 2nd problem that will emerge if the first one is fixed, see: https://github.com/microsoft/DeepSpeed/issues/2812 - I discovered it some months back but also yesterday when I was hoping to give you a simpler hack - specifically in the diff I shared disabling zero.Init only for from_config. I have some hacky ideas to solve it, but not yet an elegant solution.

I will ponder meanwhile how we can fix this on the integration side. This should be totally doable, just need to find an elegant way of doing that.

Mind you, composed models is a new thing, so a new need calls for a new solution.

Feb 10 '23 17:02 stas00

I can confirm that with the hacky solution as shown in https://github.com/huggingface/transformers/issues/21538#issuecomment-1425138938, the recursion error is gone. It took a bit longer as I had to patch the files from within a container on Azure, something I do not do every day.

Unfortunately you were also right that I still get an OOM on the 16GB V100 from Azure. I was hoping that with offloading parameters I would possibly fit. I will try to fiddle a bit with the Deepspeed parameters but I guess I have to use gradient accumulation.

Feb 21 '23 07:02 dennisbakhuis

Thank you for doing the experiment, Dennis. Glad to hear it worked.

The Deepspeed team are actively working on resolving these 2 issues: https://github.com/microsoft/DeepSpeed/issues/2811, https://github.com/microsoft/DeepSpeed/issues/2812 so hopefully we should have a working solution soon, which would require just updating your deepspeed version.

Feb 21 '23 19:02 stas00

how about this one https://github.com/microsoft/DeepSpeed/issues/2637 . It seems the only option is disable zero.init with accelerate.

Mar 10 '23 01:03 dumpmemory

This issue is being addressed in:

https://github.com/microsoft/DeepSpeed/issues/2811
https://github.com/microsoft/DeepSpeed/issues/2812

which I think should resolve the leak as well. The Deepspeed team are actively working on solving both.

Mar 10 '23 01:03 stas00

Actually @tohtana has just created a PR that is supposed to fix both issues: https://github.com/microsoft/DeepSpeed/pull/2989

I will be able to try it probably tomorrow, but please go ahead and try it and send any yay/nay feedback to that PR if you do. Thank you!

Mar 10 '23 01:03 stas00

Actually @tohtana has just created a PR that is supposed to fix both issues: microsoft/DeepSpeed#2989

I will be able to try it probably tomorrow, but please go ahead and try it and send any yay/nay feedback to that PR if you do. Thank you!

I will, thanks. if i get the result, i will update result here

Mar 10 '23 03:03 dumpmemory

https://github.com/microsoft/DeepSpeed/issues/2637 still exists with https://github.com/microsoft/DeepSpeed/pull/2989

my setting is here https://github.com/huggingface/peft/issues/161

Mar 10 '23 04:03 dumpmemory

@dennisbakhuis, please try with this PR https://github.com/microsoft/DeepSpeed/pull/2989 - I tested and your repro now works.

you will also need to add to ds_config.json top-level (this is an unrelated change)

  "zero_force_ds_cpu_optimizer": false,

Please let me know if it works for you.

Mar 13 '23 04:03 stas00

Thank you for testing https://github.com/microsoft/DeepSpeed/pull/2989, @dumpmemory - sorry to hear it didn't resolve the leak - perhaps file a new issue in DS, as the one I posted I couldn't provide a repro script as it was part of the complex system, but perhaps you could. That should help a lot with solving it.

Mar 13 '23 05:03 stas00

Thank you for testing microsoft/DeepSpeed#2989, @dumpmemory - sorry to hear it didn't resolve the leak - perhaps file a new issue in DS, as the one I posted I couldn't provide a repro script as it was part of the complex system, but perhaps you could. That should help a lot with solving it.

thanks for your response. i had posted an issue there. thanks again.

Mar 15 '23 15:03 dumpmemory

https://github.com/microsoft/DeepSpeed/pull/2989 has been merged, so closing this Issue.

To verify the solution please use deepspeed@master until the next release (0.9.1? is made)

Apr 14 '23 18:04 stas00

transformers transformers copied to clipboard

Deepspeed Stage3 using trainer and base DONUT model results in RecursionError.

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard