text-generation-inference
text-generation-inference copied to clipboard
Cannot serve llama-30b on T4 gpus, but can serve llama-7b with same code
System Info
Centos 7 and docker 23.0.5. 8 T4 gpus driver is 515.65.1
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
Firstly, I generate pretrained checkpoints:
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
from transformers import LlamaForCausalLM, LlamaTokenizer, LlamaConfig
model_name = 'decapoda-research/llama-30b-hf'
save_path = '/data/llama_30b'
config = AutoConfig.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype='auto', config=config)
tokenizer = LlamaTokenizer.from_pretrained(model_name)
config.save_pretrained(save_path)
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
tokenizer = AutoTokenizer.from_pretrained(save_path)
tokenizer.save_pretrained(save_path)
tokenizer = AutoTokenizer.from_pretrained(save_path)
Then I launch serving:
docker run --gpus all --shm-size 64g -p 8080:80 -v /data:/data ghcr.io/huggingface/text-generation-inference:0.7 --num-shard 8 --model-id llama_30b
It starts correctly, and then I call it locally:
curl 127.0.0.1:8080/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' -H 'Content-Type: application/json'
I got the error message like this:
Expected behavior
Expect it to return result as I use llama-7b model.
We also encountered exactly the same problem and tested it in three different environments. At the same time, this issue seems to only appear on the llama 30B model.
Ubuntu 20.04 / Nvidia Driver 525.105.17 / Image Version 0.8.0 / A100 * 8
Ubuntu 20.04 / Nvidia Driver 525.105.17 / Image Version 0.8.0 / T4 * 8
CentOS 7 / Nvidia Driver 470.82.01 / Image Version 0.8.0 / T4 * 8
I think I found the problem. The Llama 30B model has num_heads = 52, and it cannot be divided by 8. Therefore, it naturally cannot use shard = 8 for parallel inference.
The Llama 30B model has num_heads = 52, and it cannot be divided by 8. Therefore, it naturally cannot use shard = 8 for parallel inference.
Thanks for the investigation! Very weird that this doesn't crash earlier I will into it a bit.
Solved with the new loading logic.