DeepSpeedExamples
DeepSpeedExamples copied to clipboard
[Bug] DeepSpeed Inference Does not Work with LLaMA (Latest verison)
Version
deepspeed: 0.13.4
transformers: 4.38.1
Python: 3.10
Pytorch: 2.1.2+cu121
CUDA: 12.1
Error in Example (To reproduce)
Just simply run this script https://github.com/microsoft/DeepSpeedExamples/blob/master/inference/huggingface/text-generation/inference-test.py
deepspeed --num_gpus 8 inference-test.py --model meta-llama/Llama-2-7b-hf --use-kernel
It will show the following error:
Traceback (most recent call last):
File "/root/DeepSpeedExamples/inference/huggingface/text-generation/inference-test.py", line 82, in <module>
outputs = pipe(inputs,
File "/root/DeepSpeedExamples/inference/huggingface/text-generation/utils.py", line 71, in __call__
outputs = self.generate_outputs(input_list, num_tokens=num_tokens, do_sample=do_sample)
File "/root/DeepSpeedExamples/inference/huggingface/text-generation/utils.py", line 119, in generate_outputs
outputs = self.model.generate(input_tokens.input_ids, **generate_kwargs)
File "/miniconda/envs/py310/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 636, in _generate
return self.module.generate(*inputs, **kwargs)
File "/miniconda/envs/py310/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/miniconda/envs/py310/lib/python3.10/site-packages/transformers/generation/utils.py", line 1592, in generate
return self.sample(
File "/miniconda/envs/py310/lib/python3.10/site-packages/transformers/generation/utils.py", line 2693, in sample
model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
File "/miniconda/envs/py310/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1256, in prepare_inputs_for_generation
if past_key_value := getattr(self.model.layers[0].self_attn, "past_key_value", None):
File "/miniconda/envs/py310/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1695, in __getattr__
raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'DeepSpeedGPTInference' object has no attribute 'self_attn'
Potential bug?
I suspect it did not find the right inference engine?, which should be DeepSpeedLlamaInference
but not DeepSpeedGPTInference
?
Would be nice if someone can tell which version will work
Hi @allanj I don't think we have kernel injection support for llama-2 models. If you remove the --use_kernel
flag does the script work?
Additionally, what kind of GPUs are you using? You may be able to utilize DeepSpeed-MII to run the llama-2 model and get significant improvements to inference performance if you have GPUs with compute capability >=8.0:
import mii
client = mii.serve("meta-llama/Llama-2-7b-hf", tensor_parallel=8)
response = client("test prompt")
Yes. Removing the --use_kernel
make it work.
Yeah, I realize the DeepSpeed FastGen. Wondering, how does it support the batch size? Or I simply make a for loop about that