Stas Bekman comments

Results 664 comments of


                                            Stas Bekman

[performance] from_pretrained is still much slower than torch.load and seems to be initializing weights

For the timeline questions we need to ask @sgugger

Deepspeed Stage3 using trainer and base DONUT model results in RecursionError.

Hi @dennisbakhuis In a bit we will move this to https://github.com/microsoft/DeepSpeed/issues as this is not the integration problem. As I discovered this recently when trying to build a multi-modal model...

Deepspeed Stage3 using trainer and base DONUT model results in RecursionError.

Meanwhile the workaround I did is this: as one of the models was much smaller than the other I initialized the smaller one w/o `zero.Init` and the other normally w/...

Deepspeed Stage3 using trainer and base DONUT model results in RecursionError.

The minimal repro is just this: ``` from transformers import DonutProcessor, VisionEncoderDecoderModel import torch import deepspeed from transformers.deepspeed import HfDeepSpeedConfig ds_config = dict(train_batch_size=1, zero_optimization=dict(stage=3)) dschf = HfDeepSpeedConfig(ds_config) # keep this...

Deepspeed Stage3 using trainer and base DONUT model results in RecursionError.

The cause proved to be 2 `from_config` calls each invoking `zero.Init` context internally. https://github.com/huggingface/transformers/blob/97d3390fc8edb210fcf0aad6a079406b018655b9/src/transformers/models/vision_encoder_decoder/modeling_vision_encoder_decoder.py#L191-L195

Deepspeed Stage3 using trainer and base DONUT model results in RecursionError.

BTW, do you have enough cpu memory to load this model? In this case a temp hack would be very simple, just disable the `zero.Init` contexts directly: ``` diff --git...

Deepspeed Stage3 using trainer and base DONUT model results in RecursionError.

OK, I reduced the problem to this repro: ``` import torch import deepspeed ds_config = dict(train_batch_size=1, zero_optimization=dict(stage=3)) class MyModel(torch.nn.Module): def __init__(self, m1): super().__init__() self.m1 = m1 with deepspeed.zero.Init(config_dict_or_path=ds_config): with deepspeed.zero.Init(config_dict_or_path=ds_config):...

Deepspeed Stage3 using trainer and base DONUT model results in RecursionError.

OK, I filed the report here: https://github.com/microsoft/DeepSpeed/issues/2811

Deepspeed Stage3 using trainer and base DONUT model results in RecursionError.

> deepspeed.zero.Init should only be called once at the moment, yes > What is unclear to me is who to "blame" (in a positive sense (-;). ... If you read...

Deepspeed Stage3 using trainer and base DONUT model results in RecursionError.

Thank you for doing the experiment, Dennis. Glad to hear it worked. The Deepspeed team are actively working on resolving these 2 issues: https://github.com/microsoft/DeepSpeed/issues/2811, https://github.com/microsoft/DeepSpeed/issues/2812 so hopefully we should have...