DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] state dict loading issue when running an example in https://github.com/huggingface/transformers-bloom-inference/tree/main/bloom-inference-scripts#run

Open liligwu opened this issue 2 years ago • 1 comments

Describe the bug see the following error message when running the example in https://github.com/huggingface/transformers-bloom-inference/tree/main/bloom-inference-scripts#run deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-fp16

│ /opt/conda/lib/python3.8/site-packages/deepspeed/inference/engine.py:388 in _load_checkpoint     │
│                                                                                                  │
│   385 │   │   else:                                                                              │
│   386 │   │   │   mp_rank = 0 if self.mpu is None else self.mpu.get_model_parallel_rank()        │
│   387 │   │   │                                                                                  │
│ ❱ 388 │   │   │   load_path, checkpoint, quantize_config = sd_loader.load(self._config.tensor_   │
│   389 │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   mp_rank,               │
│   390 │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   is_pipe_parallel=is_   │
│   391 │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   │   quantize=(self._conf   │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'dict' object has no attribute 'load'

To Reproduce Steps to reproduce the behavior:

  1. switch kernel_inject to False in https://github.com/huggingface/transformers-bloom-inference/blob/7bea3526d8270b4aeeefecc57d7d7d638e2bbe0e/bloom-inference-scripts/bloom-ds-inference.py#L122
  2. run deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-fp16 in https://github.com/huggingface/transformers-bloom-inference/tree/main/bloom-inference-scripts

Docker context rocm/pytorch:latest

Additional context Running on ROCm, but this issue shouldn't be hardware-dependent. According to https://github.com/microsoft/DeepSpeed/blob/5f5cc82415b11ab2c9bf85969deb510fe8631446/deepspeed/inference/engine.py#L433 and https://github.com/microsoft/DeepSpeed/blob/5f5cc82415b11ab2c9bf85969deb510fe8631446/deepspeed/runtime/state_dict_factory.py#L37, sd_loader is a dict and has no load method.

liligwu avatar Apr 07 '23 20:04 liligwu

@jeffra, would you please look at this issue? Thank you.

liligwu avatar Apr 13 '23 19:04 liligwu

@liligwu - can you confirm if you are still hitting this issue with the latest DeepSpeed/ROCm/transformers?

loadams avatar Aug 14 '23 19:08 loadams

Hi @liligwu - following up on this issue, are you still hitting this?

loadams avatar Aug 30 '23 21:08 loadams

Hi @loadams , It has been a while. Please give me some time to confirm if the issue persists.

liligwu avatar Aug 31 '23 14:08 liligwu

Thanks @liligwu - no rush, just wanted to make sure if this needed reviews and completing that we would get this merged in, or otherwise closed. Thanks again!

loadams avatar Aug 31 '23 15:08 loadams

Hi @liligwu - following up on this, if you've had time to look at this.

loadams avatar Oct 20 '23 16:10 loadams

@liligwu - closing this for now - if you have updates on this, please comment and we will re-open it. Or if anyone else has a similar issue and sees this, please open a new issue and link this one and we would be happy to take a look.

loadams avatar Jan 05 '24 22:01 loadams