Stas Bekman
Stas Bekman
For the timeline questions we need to ask @sgugger
Hi @dennisbakhuis In a bit we will move this to https://github.com/microsoft/DeepSpeed/issues as this is not the integration problem. As I discovered this recently when trying to build a multi-modal model...
Meanwhile the workaround I did is this: as one of the models was much smaller than the other I initialized the smaller one w/o `zero.Init` and the other normally w/...
The minimal repro is just this: ``` from transformers import DonutProcessor, VisionEncoderDecoderModel import torch import deepspeed from transformers.deepspeed import HfDeepSpeedConfig ds_config = dict(train_batch_size=1, zero_optimization=dict(stage=3)) dschf = HfDeepSpeedConfig(ds_config) # keep this...
The cause proved to be 2 `from_config` calls each invoking `zero.Init` context internally. https://github.com/huggingface/transformers/blob/97d3390fc8edb210fcf0aad6a079406b018655b9/src/transformers/models/vision_encoder_decoder/modeling_vision_encoder_decoder.py#L191-L195
BTW, do you have enough cpu memory to load this model? In this case a temp hack would be very simple, just disable the `zero.Init` contexts directly: ``` diff --git...
OK, I reduced the problem to this repro: ``` import torch import deepspeed ds_config = dict(train_batch_size=1, zero_optimization=dict(stage=3)) class MyModel(torch.nn.Module): def __init__(self, m1): super().__init__() self.m1 = m1 with deepspeed.zero.Init(config_dict_or_path=ds_config): with deepspeed.zero.Init(config_dict_or_path=ds_config):...
OK, I filed the report here: https://github.com/microsoft/DeepSpeed/issues/2811
> deepspeed.zero.Init should only be called once at the moment, yes > What is unclear to me is who to "blame" (in a positive sense (-;). ... If you read...
Thank you for doing the experiment, Dennis. Glad to hear it worked. The Deepspeed team are actively working on resolving these 2 issues: https://github.com/microsoft/DeepSpeed/issues/2811, https://github.com/microsoft/DeepSpeed/issues/2812 so hopefully we should have...