text-generation-inference
text-generation-inference copied to clipboard
The TGI loading model consumes all available gpus memory
System Info
Environments
>> cat docker-compose.yml
version: '3.8'
services:
llama2_api:
image: ghcr.io/huggingface/text-generation-inference:1.4
container_name: llama2_api
command: --model-id /data/llama2/llama2-chat-13b-hf
volumes:
- /data/wanghui01/models/:/data/
ports:
- "8081:80"
environment:
NVIDIA_VISIBLE_DEVICES: all
CUDA_VISIBLE_DEVICES: "0,1,2,3,4,5,6,7"
shm_size: 1g
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
>> curl 127.0.0.1:8081/info | jq
{
"model_id": "/data/llama2/llama2-chat-13b-hf",
"model_sha": null,
"model_dtype": "torch.float16",
"model_device_type": "cuda",
"model_pipeline_tag": null,
"max_concurrent_requests": 128,
"max_best_of": 2,
"max_stop_sequences": 4,
"max_input_length": 1024,
"max_total_tokens": 2048,
"waiting_served_ratio": 1.2,
"max_batch_total_tokens": 342560,
"max_waiting_tokens": 20,
"max_batch_size": null,
"validation_workers": 2,
"version": "1.4.4",
"sha": "6c4496a1a30f119cebd3afbfedd847039325dbc9",
"docker_label": "sha-6c4496a"
}
>> docker exec f4f ls -lh /data/llama2/llama2-chat-13b-hf
total 49G
-rw-r--r-- 1 root root 638 Feb 5 01:49 config.json
-rw-r--r-- 1 root root 111 Feb 5 01:49 generation_config.json
-rw-r--r-- 1 root root 4.7G Mar 27 07:11 model-00001-of-00006.safetensors
-rw-r--r-- 1 root root 4.7G Mar 27 07:12 model-00002-of-00006.safetensors
-rw-r--r-- 1 root root 4.7G Mar 27 07:12 model-00003-of-00006.safetensors
-rw-r--r-- 1 root root 4.6G Mar 27 07:11 model-00004-of-00006.safetensors
-rw-r--r-- 1 root root 4.6G Mar 27 07:11 model-00005-of-00006.safetensors
-rw-r--r-- 1 root root 1.2G Mar 27 07:12 model-00006-of-00006.safetensors
-rw-r--r-- 1 root root 4.7G Feb 5 01:50 pytorch_model-00001-of-00006.bin
-rw-r--r-- 1 root root 4.7G Feb 5 01:50 pytorch_model-00002-of-00006.bin
-rw-r--r-- 1 root root 4.7G Feb 5 01:50 pytorch_model-00003-of-00006.bin
-rw-r--r-- 1 root root 4.6G Feb 5 01:50 pytorch_model-00004-of-00006.bin
-rw-r--r-- 1 root root 4.6G Feb 5 01:50 pytorch_model-00005-of-00006.bin
-rw-r--r-- 1 root root 1.2G Feb 5 01:50 pytorch_model-00006-of-00006.bin
-rw-r--r-- 1 root root 30K Feb 5 01:50 pytorch_model.bin.index.json
-rw-r--r-- 1 root root 414 Feb 5 01:48 special_tokens_map.json
-rw-r--r-- 1 root root 1.8M Feb 5 01:48 tokenizer.json
-rw-r--r-- 1 root root 489K Feb 5 01:48 tokenizer.model
-rw-r--r-- 1 root root 932 Feb 5 01:48 tokenizer_config.json
>> nvidia-smi
Sun Apr 28 13:51:36 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06 Driver Version: 545.29.06 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-PCIE-40GB Off | 00000000:0B:00.0 Off | 0 |
| N/A 31C P0 35W / 250W | 37625MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCIE-40GB Off | 00000000:0C:00.0 Off | 0 |
| N/A 31C P0 34W / 250W | 37633MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-PCIE-40GB Off | 00000000:0F:00.0 Off | 0 |
| N/A 31C P0 34W / 250W | 37633MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-PCIE-40GB Off | 00000000:14:00.0 Off | 0 |
| N/A 29C P0 36W / 250W | 37633MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-PCIE-40GB Off | 00000000:15:00.0 Off | 0 |
| N/A 30C P0 35W / 250W | 37633MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-PCIE-40GB Off | 00000000:18:00.0 Off | 0 |
| N/A 31C P0 36W / 250W | 37633MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-PCIE-40GB Off | 00000000:1C:00.0 Off | 0 |
| N/A 31C P0 38W / 250W | 37633MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-PCIE-40GB Off | 00000000:24:00.0 Off | 0 |
| N/A 30C P0 37W / 250W | 37593MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 303244 C /opt/conda/bin/python3.10 37612MiB |
| 1 N/A N/A 303245 C /opt/conda/bin/python3.10 37620MiB |
| 2 N/A N/A 303248 C /opt/conda/bin/python3.10 37620MiB |
| 3 N/A N/A 303252 C /opt/conda/bin/python3.10 37620MiB |
| 4 N/A N/A 303251 C /opt/conda/bin/python3.10 37620MiB |
| 5 N/A N/A 303256 C /opt/conda/bin/python3.10 37620MiB |
| 6 N/A N/A 303254 C /opt/conda/bin/python3.10 37620MiB |
| 7 N/A N/A 303260 C /opt/conda/bin/python3.10 37580MiB |
+---------------------------------------------------------------------------------------+
When I just load the model with transfomer, it's obvious that the gpus memory is normal.
>> cat demo.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation.utils import GenerationConfig
model_path = "/data/wanghui01/models/llama2/llama2-chat-13b-hf/"
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
model.generation_config = GenerationConfig.from_pretrained(model_path)
input("press any key to continue...")
>> python demo.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:05<00:00, 1.03it/s]
press any key to continue...
>> nvidia-smi
Sun Apr 28 13:56:31 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06 Driver Version: 545.29.06 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-PCIE-40GB Off | 00000000:0B:00.0 Off | 0 |
| N/A 30C P0 35W / 250W | 3179MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCIE-40GB Off | 00000000:0C:00.0 Off | 0 |
| N/A 30C P0 34W / 250W | 4083MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-PCIE-40GB Off | 00000000:0F:00.0 Off | 0 |
| N/A 31C P0 34W / 250W | 4083MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-PCIE-40GB Off | 00000000:14:00.0 Off | 0 |
| N/A 29C P0 36W / 250W | 4083MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-PCIE-40GB Off | 00000000:15:00.0 Off | 0 |
| N/A 30C P0 35W / 250W | 4083MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-PCIE-40GB Off | 00000000:18:00.0 Off | 0 |
| N/A 30C P0 36W / 250W | 4083MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-PCIE-40GB Off | 00000000:1C:00.0 Off | 0 |
| N/A 31C P0 38W / 250W | 4083MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-PCIE-40GB Off | 00000000:24:00.0 Off | 0 |
| N/A 30C P0 37W / 250W | 741MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 338969 C python 3166MiB |
| 1 N/A N/A 338969 C python 4070MiB |
| 2 N/A N/A 338969 C python 4070MiB |
| 3 N/A N/A 338969 C python 4070MiB |
| 4 N/A N/A 338969 C python 4070MiB |
| 5 N/A N/A 338969 C python 4070MiB |
| 6 N/A N/A 338969 C python 4070MiB |
| 7 N/A N/A 338969 C python 728MiB |
+---------------------------------------------------------------------------------------+
I'm confused and wondering what could be causing this, maybe someone can give me some advice.
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
- make docker-compose.yml
>> cat docker-compose.yml
version: '3.8'
services:
llama2_api:
image: ghcr.io/huggingface/text-generation-inference:1.4
container_name: llama2_api
command: --model-id /data/llama2/llama2-chat-13b-hf
volumes:
- /data/wanghui01/models/:/data/
ports:
- "8081:80"
environment:
NVIDIA_VISIBLE_DEVICES: all
CUDA_VISIBLE_DEVICES: "0,1,2,3,4,5,6,7"
shm_size: 1g
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
- Run the container and check the memory
>> docker compose up llama2_api -d
[+] Running 1/1
✔ Container llama2_api Started
>> nvidia-smi
Expected behavior
- The memory of the Gep is similar to that of the transfomers loading model.
Facing same issue.. any update why this is happening ?
Facing same issue.. any update why this is happening ?
facing same issue. https://github.com/huggingface/text-generation-inference/issues/1300#issuecomment-1859867587 helped
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.