transformers
transformers copied to clipboard
Deepspeed Stage3 using trainer and base DONUT model results in RecursionError.
System Info
- Running on AzureML Standard_NC6S_V3 with curated environment: AzureML-ACPT-pytorch-1.12-py39-cuda11.6-gpu
transformersversion: 4.26.0- Platform: Linux-5.0.0-1036-azure-x86_64-with-glibc2.31
- Python version: 3.9.15
- Huggingface_hub version: 0.12.0
- PyTorch version (GPU?): 1.12.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Through trainer
- Using distributed or parallel set-up in script?: Through deepspeed/trainer
Who can help?
I am using a base DONUT model, The error only happens with Deepspeed stage3: @stas00
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
I am fine-tuning a DONUT based model on an Azure Standard_NC6S_V3 (1 x V100 (16GB)) using AzureML. Below is a minimal example to reproduce the recursion error.
# Train script
import transformers
from transformers import (
DonutProcessor,
Seq2SeqTrainer,
Seq2SeqTrainingArguments,
VisionEncoderDecoderModel,
)
from PIL import Image
import datasets
base_model = "naver-clova-ix/donut-base"
image_size = { "width": 680, "height": 960 }
def main():
# Main
training_args = Seq2SeqTrainingArguments(
output_dir='./output',
num_train_epochs=1,
per_device_train_batch_size=2,
fp16=True,
deepspeed='deepspeed_config.json',
)
model = VisionEncoderDecoderModel.from_pretrained(base_model)
processor = DonutProcessor.from_pretrained(base_model)
# Resize image size in model/processor
processor.image_processor.size = image_size
model.config.encoder.image_size = tuple(processor.image_processor.size.values())[::-1]
model.config.hidden_size = model.config.encoder.hidden_size # Deepspeed needs this fix
# Generate bogus dataset
image = Image.new('RGB', (image_size['width'], image_size['height']))
text = '{"great_key": "great_value"}'
N = 16
data = [{'image': image, 'text': text} for _ in range(N)]
dataset = datasets.Dataset.from_list(data)
# Tokenize bogus dataset
def tokenize(example, processor):
pixel_values = processor(
example["image"],
random_padding=True,
return_tensors="pt",
).pixel_values.squeeze()
input_ids = processor.tokenizer( # type: ignore
example["text"],
add_special_tokens=False,
max_length=512,
padding="max_length",
truncation=True,
return_tensors="pt",
)["input_ids"].squeeze(0)
labels = input_ids.clone()
return {
"pixel_values": pixel_values,
"labels": labels,
"target_sequence": example["text"],
}
input_dataset = dataset.map(
lambda x: tokenize(x, processor),
remove_columns=['image', 'text'],
)
# Train
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=input_dataset,
)
trainer.remove_callback(transformers.integrations.AzureMLCallback)
trainer.train()
if __name__ == "__main__":
main()
{
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"train_batch_size": "auto",
"fp16": {
"enabled": "auto"
}
}
Probably not relevant but here the submit job script.
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from azure.ai.ml import command
compute_name = ""
environment_name = ""
ml_client = MLClient.from_config(
credential=DefaultAzureCredential(),
path='/',
)
environment = ml_client.environments.get(environment_name, label="latest")
fail_job = command(
code='./fail_train',
command="transformers-cli env && deepspeed --num_gpus 1 failure_train_script.py",
compute=compute_name,
environment=environment,
)
job = ml_client.jobs.create_or_update(
fail_job,
experiment_name="testing",
)
Expected behavior
When using Deepspeed stage2 all is working but for large images I get OOM on the V100 16GB GPU. Therefore, I want to try Deepspeed stage3 but this results in the maximum recursion error.
From what I have read, the recursion error is due to deepspeed's zero initialisation, however these bits are a bit hidden when using trainer and I am not sure where to look. I am more than happy to investigate but I definitely need some guidance (-:
I expect training to start with hopefully some memory savings such that I can train a DONUT based model on V100 or smaller GPU.
Hi @dennisbakhuis
In a bit we will move this to https://github.com/microsoft/DeepSpeed/issues as this is not the integration problem.
As I discovered this recently when trying to build a multi-modal model based on 2 pre-trained models you can only use zero.Init once. If you use it again it breaks (infinite recursion) in deepspeed.initialize.
p.s. I edited the OP to remove other maintainers since this ticket is mine ;)
But let's try to unravel it here first and I think I have a workaround for you as well.
Meanwhile the workaround I did is this: as one of the models was much smaller than the other I initialized the smaller one w/o zero.Init and the other normally w/ zero.Init and it worked. Is this a similar situation here, and the processor is much smaller than the cv model?
I could rig up a workaround for you. But your situation is different - you have 2 separate models. Let me think.
ok, processor is not a model, so it shouldn't even be called under zero.Init in the first place. this is interesting!
The minimal repro is just this:
from transformers import DonutProcessor, VisionEncoderDecoderModel
import torch
import deepspeed
from transformers.deepspeed import HfDeepSpeedConfig
ds_config = dict(train_batch_size=1, zero_optimization=dict(stage=3))
dschf = HfDeepSpeedConfig(ds_config) # keep this object alive
model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
deepspeed_engine, *_ = deepspeed.initialize(model=model, config_params=ds_config)
$ deepspeed --num_gpus 1 test.py
Traceback (most recent call last):
File "test.py", line 13, in <module>
deepspeed_engine, *_ = deepspeed.initialize(model=model, config_params=ds_config)
File "/mnt/nvme0/code/github/00optimize/DeepSpeed-optim-grad-accessors/deepspeed/__init__.py", line 125, in initialize
engine = DeepSpeedEngine(args=args,
File "/mnt/nvme0/code/github/00optimize/DeepSpeed-optim-grad-accessors/deepspeed/runtime/zero/partition_parameters.py", line 350, in wrapper
if not hasattr(module, "_ds_child_entered"):
File "/mnt/nvme0/code/github/00optimize/DeepSpeed-optim-grad-accessors/deepspeed/runtime/engine.py", line 490, in __getattr__
[...]
if name in dir(self):
File "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2026, in __dir__
parameters = list(self._parameters.keys())
File "/mnt/nvme0/code/github/00optimize/DeepSpeed-optim-grad-accessors/deepspeed/runtime/engine.py", line 490, in __getattr__
if name in dir(self):
File "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2026, in __dir__
parameters = list(self._parameters.keys())
File "/mnt/nvme0/code/github/00optimize/DeepSpeed-optim-grad-accessors/deepspeed/runtime/engine.py", line 490, in __getattr__
if name in dir(self):
File "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2024, in __dir__
module_attrs = dir(self.__class__)
RecursionError: maximum recursion depth exceeded while calling a Python object
The initial thought that 2 from_pretrained caused it - isn't the case, the problem is somewhere in the from_pretrained of this model.
The cause proved to be 2 from_config calls each invoking zero.Init context internally.
https://github.com/huggingface/transformers/blob/97d3390fc8edb210fcf0aad6a079406b018655b9/src/transformers/models/vision_encoder_decoder/modeling_vision_encoder_decoder.py#L191-L195
BTW, do you have enough cpu memory to load this model?
In this case a temp hack would be very simple, just disable the zero.Init contexts directly:
diff --git a/src/transformers/modeling_utils.py b/src/transformers/modeling_utils.py
index c9d304f25..c2e530275 100644
--- a/src/transformers/modeling_utils.py
+++ b/src/transformers/modeling_utils.py
@@ -1085,7 +1085,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
if torch_dtype is not None:
dtype_orig = cls._set_default_torch_dtype(torch_dtype)
- if is_deepspeed_zero3_enabled():
+ if 0: # is_deepspeed_zero3_enabled():
import deepspeed
logger.info("Detected DeepSpeed ZeRO-3: activating zero.init() for this model")
@@ -2453,7 +2453,7 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, GenerationMixin, PushToHubMix
# Instantiate model.
init_contexts = [no_init_weights(_enable=_fast_init)]
- if is_deepspeed_zero3_enabled():
+ if 0: #is_deepspeed_zero3_enabled():
import deepspeed
logger.info("Detected DeepSpeed ZeRO-3: activating zero.init() for this model")
This should unblock you. Let me know if it does.
The accelerate deepspeed integration has separated zero3 and zero.Init, which was a smart move, so with a single flag you can disable zero.Init, while still using zero3. When designing the HF Trainer integration I made a wrong assumption that someone wanting to use zero3 would always want to use zero.Init` but as you can see there are rare cases when it's the case.
Meanwhile I will try to reduce this to a simple test case that we can present to the Deepspeed team to make it work.
OK, I reduced the problem to this repro:
import torch
import deepspeed
ds_config = dict(train_batch_size=1, zero_optimization=dict(stage=3))
class MyModel(torch.nn.Module):
def __init__(self, m1):
super().__init__()
self.m1 = m1
with deepspeed.zero.Init(config_dict_or_path=ds_config):
with deepspeed.zero.Init(config_dict_or_path=ds_config):
m1 = torch.nn.Linear(1,1)
model = MyModel(m1, m1)
deepspeed_engine, *_ = deepspeed.initialize(model=model, config_params=ds_config)
OK, I filed the report here: https://github.com/microsoft/DeepSpeed/issues/2811
Hi @stas00,
Thanks for the elaborate answers and way of thought.
Let me rephrase from what I understood: deepspeed.zero.Init should only be called once. This is something I have seen mentioned in other issues in the Deepspeed repo as well. As we have an encoder + decoder, we practically have two models, which each do a deepspeed.zero.init during the .from_config method.
What is unclear to me is who to "blame" (in a positive sense (-;). If we are only suppose to call deepspeed.zero.init once, something in transformers should be fixed, while if nested deepspeed.zero.init should be allowed (as in your minimal example), Deepspeed needs a fix.
Just thinking out loud.
I will try your suggested hacky fix and will report later.
deepspeed.zero.Init should only be called once
at the moment, yes
What is unclear to me is who to "blame" (in a positive sense (-;). ...
If you read my bug report https://github.com/microsoft/DeepSpeed/issues/2811 it already asks your exact questions:
And there is a 2nd problem that will emerge if the first one is fixed, see: https://github.com/microsoft/DeepSpeed/issues/2812 - I discovered it some months back but also yesterday when I was hoping to give you a simpler hack - specifically in the diff I shared disabling zero.Init only for from_config. I have some hacky ideas to solve it, but not yet an elegant solution.
I will ponder meanwhile how we can fix this on the integration side. This should be totally doable, just need to find an elegant way of doing that.
Mind you, composed models is a new thing, so a new need calls for a new solution.
I can confirm that with the hacky solution as shown in https://github.com/huggingface/transformers/issues/21538#issuecomment-1425138938, the recursion error is gone. It took a bit longer as I had to patch the files from within a container on Azure, something I do not do every day.
Unfortunately you were also right that I still get an OOM on the 16GB V100 from Azure. I was hoping that with offloading parameters I would possibly fit. I will try to fiddle a bit with the Deepspeed parameters but I guess I have to use gradient accumulation.
Thank you for doing the experiment, Dennis. Glad to hear it worked.
The Deepspeed team are actively working on resolving these 2 issues: https://github.com/microsoft/DeepSpeed/issues/2811, https://github.com/microsoft/DeepSpeed/issues/2812 so hopefully we should have a working solution soon, which would require just updating your deepspeed version.
how about this one https://github.com/microsoft/DeepSpeed/issues/2637 . It seems the only option is disable zero.init with accelerate.
This issue is being addressed in:
- https://github.com/microsoft/DeepSpeed/issues/2811
- https://github.com/microsoft/DeepSpeed/issues/2812
which I think should resolve the leak as well. The Deepspeed team are actively working on solving both.
Actually @tohtana has just created a PR that is supposed to fix both issues: https://github.com/microsoft/DeepSpeed/pull/2989
I will be able to try it probably tomorrow, but please go ahead and try it and send any yay/nay feedback to that PR if you do. Thank you!
Actually @tohtana has just created a PR that is supposed to fix both issues: microsoft/DeepSpeed#2989
I will be able to try it probably tomorrow, but please go ahead and try it and send any yay/nay feedback to that PR if you do. Thank you!
I will, thanks. if i get the result, i will update result here
https://github.com/microsoft/DeepSpeed/issues/2637 still exists with https://github.com/microsoft/DeepSpeed/pull/2989
my setting is here https://github.com/huggingface/peft/issues/161
@dennisbakhuis, please try with this PR https://github.com/microsoft/DeepSpeed/pull/2989 - I tested and your repro now works.
you will also need to add to ds_config.json top-level (this is an unrelated change)
"zero_force_ds_cpu_optimizer": false,
Please let me know if it works for you.
Thank you for testing https://github.com/microsoft/DeepSpeed/pull/2989, @dumpmemory - sorry to hear it didn't resolve the leak - perhaps file a new issue in DS, as the one I posted I couldn't provide a repro script as it was part of the complex system, but perhaps you could. That should help a lot with solving it.
Thank you for testing microsoft/DeepSpeed#2989, @dumpmemory - sorry to hear it didn't resolve the leak - perhaps file a new issue in DS, as the one I posted I couldn't provide a repro script as it was part of the complex system, but perhaps you could. That should help a lot with solving it.
thanks for your response. i had posted an issue there. thanks again.
https://github.com/microsoft/DeepSpeed/pull/2989 has been merged, so closing this Issue.
To verify the solution please use deepspeed@master until the next release (0.9.1? is made)