text-generation-inference Cannot serve llama-30b on T4 gpus, but can serve llama-7b with same code

Cannot serve llama-30b on T4 gpus, but can serve llama-7b with same code

Open CoinCheung opened this issue 2 years ago • 3 comments

trafficstars

System Info

Centos 7 and docker 23.0.5. 8 T4 gpus driver is 515.65.1

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Firstly, I generate pretrained checkpoints:

from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
from transformers import LlamaForCausalLM, LlamaTokenizer, LlamaConfig


model_name = 'decapoda-research/llama-30b-hf'
save_path = '/data/llama_30b'

config = AutoConfig.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype='auto', config=config)
tokenizer = LlamaTokenizer.from_pretrained(model_name)

config.save_pretrained(save_path)
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
tokenizer = AutoTokenizer.from_pretrained(save_path)
tokenizer.save_pretrained(save_path)
tokenizer = AutoTokenizer.from_pretrained(save_path)

Then I launch serving:

docker run --gpus all --shm-size 64g -p 8080:80 -v /data:/data ghcr.io/huggingface/text-generation-inference:0.7 --num-shard 8 --model-id llama_30b

It starts correctly, and then I call it locally:

curl 127.0.0.1:8080/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' -H 'Content-Type: application/json'

I got the error message like this:

Expected behavior

Expect it to return result as I use llama-7b model.

Jun 02 '23 01:06 CoinCheung

We also encountered exactly the same problem and tested it in three different environments. At the same time, this issue seems to only appear on the llama 30B model. Ubuntu 20.04 / Nvidia Driver 525.105.17 / Image Version 0.8.0 / A100 * 8 Ubuntu 20.04 / Nvidia Driver 525.105.17 / Image Version 0.8.0 / T4 * 8 CentOS 7 / Nvidia Driver 470.82.01 / Image Version 0.8.0 / T4 * 8

Jun 02 '23 03:06 pandazki

I think I found the problem. The Llama 30B model has num_heads = 52, and it cannot be divided by 8. Therefore, it naturally cannot use shard = 8 for parallel inference.

Jun 02 '23 04:06 pandazki

The Llama 30B model has num_heads = 52, and it cannot be divided by 8. Therefore, it naturally cannot use shard = 8 for parallel inference.

Thanks for the investigation! Very weird that this doesn't crash earlier I will into it a bit.

Jun 02 '23 08:06 OlivierDehaene

Solved with the new loading logic.

Jun 13 '23 15:06 OlivierDehaene

text-generation-inference text-generation-inference copied to clipboard

Cannot serve llama-30b on T4 gpus, but can serve llama-7b with same code

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard