DeepSpeed
DeepSpeed copied to clipboard
[BUG] state dict loading issue when running an example in https://github.com/huggingface/transformers-bloom-inference/tree/main/bloom-inference-scripts#run
Describe the bug
see the following error message when running the example in https://github.com/huggingface/transformers-bloom-inference/tree/main/bloom-inference-scripts#run
deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-fp16
│ /opt/conda/lib/python3.8/site-packages/deepspeed/inference/engine.py:388 in _load_checkpoint │
│ │
│ 385 │ │ else: │
│ 386 │ │ │ mp_rank = 0 if self.mpu is None else self.mpu.get_model_parallel_rank() │
│ 387 │ │ │ │
│ ❱ 388 │ │ │ load_path, checkpoint, quantize_config = sd_loader.load(self._config.tensor_ │
│ 389 │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ mp_rank, │
│ 390 │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ is_pipe_parallel=is_ │
│ 391 │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ quantize=(self._conf │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'dict' object has no attribute 'load'
To Reproduce Steps to reproduce the behavior:
- switch
kernel_injecttoFalsein https://github.com/huggingface/transformers-bloom-inference/blob/7bea3526d8270b4aeeefecc57d7d7d638e2bbe0e/bloom-inference-scripts/bloom-ds-inference.py#L122 - run
deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-inference.py --name microsoft/bloom-deepspeed-inference-fp16in https://github.com/huggingface/transformers-bloom-inference/tree/main/bloom-inference-scripts
Docker context rocm/pytorch:latest
Additional context
Running on ROCm, but this issue shouldn't be hardware-dependent. According to https://github.com/microsoft/DeepSpeed/blob/5f5cc82415b11ab2c9bf85969deb510fe8631446/deepspeed/inference/engine.py#L433 and https://github.com/microsoft/DeepSpeed/blob/5f5cc82415b11ab2c9bf85969deb510fe8631446/deepspeed/runtime/state_dict_factory.py#L37, sd_loader is a dict and has no load method.
@jeffra, would you please look at this issue? Thank you.
@liligwu - can you confirm if you are still hitting this issue with the latest DeepSpeed/ROCm/transformers?
Hi @liligwu - following up on this issue, are you still hitting this?
Hi @loadams , It has been a while. Please give me some time to confirm if the issue persists.
Thanks @liligwu - no rush, just wanted to make sure if this needed reviews and completing that we would get this merged in, or otherwise closed. Thanks again!
Hi @liligwu - following up on this, if you've had time to look at this.
@liligwu - closing this for now - if you have updates on this, please comment and we will re-open it. Or if anyone else has a similar issue and sees this, please open a new issue and link this one and we would be happy to take a look.