text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

Cannot serve llama-30b on T4 gpus, but can serve llama-7b with same code

Open CoinCheung opened this issue 2 years ago • 3 comments
trafficstars

System Info

Centos 7 and docker 23.0.5. 8 T4 gpus driver is 515.65.1

Information

  • [X] Docker
  • [ ] The CLI directly

Tasks

  • [X] An officially supported command
  • [ ] My own modifications

Reproduction

Firstly, I generate pretrained checkpoints:

from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
from transformers import LlamaForCausalLM, LlamaTokenizer, LlamaConfig


model_name = 'decapoda-research/llama-30b-hf'
save_path = '/data/llama_30b'

config = AutoConfig.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype='auto', config=config)
tokenizer = LlamaTokenizer.from_pretrained(model_name)

config.save_pretrained(save_path)
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
tokenizer = AutoTokenizer.from_pretrained(save_path)
tokenizer.save_pretrained(save_path)
tokenizer = AutoTokenizer.from_pretrained(save_path)

Then I launch serving:

docker run --gpus all --shm-size 64g -p 8080:80 -v /data:/data ghcr.io/huggingface/text-generation-inference:0.7 --num-shard 8 --model-id llama_30b

It starts correctly, and then I call it locally:

curl 127.0.0.1:8080/generate -X POST -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' -H 'Content-Type: application/json'

I got the error message like this: 图片

Expected behavior

Expect it to return result as I use llama-7b model.

CoinCheung avatar Jun 02 '23 01:06 CoinCheung

We also encountered exactly the same problem and tested it in three different environments. At the same time, this issue seems to only appear on the llama 30B model. Ubuntu 20.04 / Nvidia Driver 525.105.17 / Image Version 0.8.0 / A100 * 8 Ubuntu 20.04 / Nvidia Driver 525.105.17 / Image Version 0.8.0 / T4 * 8 CentOS 7 / Nvidia Driver 470.82.01 / Image Version 0.8.0 / T4 * 8

pandazki avatar Jun 02 '23 03:06 pandazki

I think I found the problem. The Llama 30B model has num_heads = 52, and it cannot be divided by 8. Therefore, it naturally cannot use shard = 8 for parallel inference.

pandazki avatar Jun 02 '23 04:06 pandazki

The Llama 30B model has num_heads = 52, and it cannot be divided by 8. Therefore, it naturally cannot use shard = 8 for parallel inference.

Thanks for the investigation! Very weird that this doesn't crash earlier I will into it a bit.

OlivierDehaene avatar Jun 02 '23 08:06 OlivierDehaene

Solved with the new loading logic.

OlivierDehaene avatar Jun 13 '23 15:06 OlivierDehaene