text-generation-inference The TGI loading model consumes all available gpus memory

System Info

Environments

>> cat docker-compose.yml 
version: '3.8'
services:
  llama2_api:
    image: ghcr.io/huggingface/text-generation-inference:1.4
    container_name: llama2_api
    command: --model-id /data/llama2/llama2-chat-13b-hf 
    volumes:
      - /data/wanghui01/models/:/data/
    ports:
      - "8081:80"
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      CUDA_VISIBLE_DEVICES: "0,1,2,3,4,5,6,7"
    shm_size: 1g
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

>> curl 127.0.0.1:8081/info | jq
{
  "model_id": "/data/llama2/llama2-chat-13b-hf",
  "model_sha": null,
  "model_dtype": "torch.float16",
  "model_device_type": "cuda",
  "model_pipeline_tag": null,
  "max_concurrent_requests": 128,
  "max_best_of": 2,
  "max_stop_sequences": 4,
  "max_input_length": 1024,
  "max_total_tokens": 2048,
  "waiting_served_ratio": 1.2,
  "max_batch_total_tokens": 342560,
  "max_waiting_tokens": 20,
  "max_batch_size": null,
  "validation_workers": 2,
  "version": "1.4.4",
  "sha": "6c4496a1a30f119cebd3afbfedd847039325dbc9",
  "docker_label": "sha-6c4496a"
}

>> docker exec f4f ls -lh /data/llama2/llama2-chat-13b-hf
total 49G
-rw-r--r-- 1 root root  638 Feb  5 01:49 config.json
-rw-r--r-- 1 root root  111 Feb  5 01:49 generation_config.json
-rw-r--r-- 1 root root 4.7G Mar 27 07:11 model-00001-of-00006.safetensors
-rw-r--r-- 1 root root 4.7G Mar 27 07:12 model-00002-of-00006.safetensors
-rw-r--r-- 1 root root 4.7G Mar 27 07:12 model-00003-of-00006.safetensors
-rw-r--r-- 1 root root 4.6G Mar 27 07:11 model-00004-of-00006.safetensors
-rw-r--r-- 1 root root 4.6G Mar 27 07:11 model-00005-of-00006.safetensors
-rw-r--r-- 1 root root 1.2G Mar 27 07:12 model-00006-of-00006.safetensors
-rw-r--r-- 1 root root 4.7G Feb  5 01:50 pytorch_model-00001-of-00006.bin
-rw-r--r-- 1 root root 4.7G Feb  5 01:50 pytorch_model-00002-of-00006.bin
-rw-r--r-- 1 root root 4.7G Feb  5 01:50 pytorch_model-00003-of-00006.bin
-rw-r--r-- 1 root root 4.6G Feb  5 01:50 pytorch_model-00004-of-00006.bin
-rw-r--r-- 1 root root 4.6G Feb  5 01:50 pytorch_model-00005-of-00006.bin
-rw-r--r-- 1 root root 1.2G Feb  5 01:50 pytorch_model-00006-of-00006.bin
-rw-r--r-- 1 root root  30K Feb  5 01:50 pytorch_model.bin.index.json
-rw-r--r-- 1 root root  414 Feb  5 01:48 special_tokens_map.json
-rw-r--r-- 1 root root 1.8M Feb  5 01:48 tokenizer.json
-rw-r--r-- 1 root root 489K Feb  5 01:48 tokenizer.model
-rw-r--r-- 1 root root  932 Feb  5 01:48 tokenizer_config.json

>> nvidia-smi 
Sun Apr 28 13:51:36 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          Off | 00000000:0B:00.0 Off |                    0 |
| N/A   31C    P0              35W / 250W |  37625MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          Off | 00000000:0C:00.0 Off |                    0 |
| N/A   31C    P0              34W / 250W |  37633MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-PCIE-40GB          Off | 00000000:0F:00.0 Off |                    0 |
| N/A   31C    P0              34W / 250W |  37633MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-PCIE-40GB          Off | 00000000:14:00.0 Off |                    0 |
| N/A   29C    P0              36W / 250W |  37633MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-PCIE-40GB          Off | 00000000:15:00.0 Off |                    0 |
| N/A   30C    P0              35W / 250W |  37633MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-PCIE-40GB          Off | 00000000:18:00.0 Off |                    0 |
| N/A   31C    P0              36W / 250W |  37633MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-PCIE-40GB          Off | 00000000:1C:00.0 Off |                    0 |
| N/A   31C    P0              38W / 250W |  37633MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-PCIE-40GB          Off | 00000000:24:00.0 Off |                    0 |
| N/A   30C    P0              37W / 250W |  37593MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    303244      C   /opt/conda/bin/python3.10                 37612MiB |
|    1   N/A  N/A    303245      C   /opt/conda/bin/python3.10                 37620MiB |
|    2   N/A  N/A    303248      C   /opt/conda/bin/python3.10                 37620MiB |
|    3   N/A  N/A    303252      C   /opt/conda/bin/python3.10                 37620MiB |
|    4   N/A  N/A    303251      C   /opt/conda/bin/python3.10                 37620MiB |
|    5   N/A  N/A    303256      C   /opt/conda/bin/python3.10                 37620MiB |
|    6   N/A  N/A    303254      C   /opt/conda/bin/python3.10                 37620MiB |
|    7   N/A  N/A    303260      C   /opt/conda/bin/python3.10                 37580MiB |
+---------------------------------------------------------------------------------------+

When I just load the model with transfomer, it's obvious that the gpus memory is normal.

>> cat demo.py 
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation.utils import GenerationConfig

model_path = "/data/wanghui01/models/llama2/llama2-chat-13b-hf/"

tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
model.generation_config = GenerationConfig.from_pretrained(model_path)
input("press any key to continue...")

>> python demo.py 
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:05<00:00,  1.03it/s]
press any key to continue...

>> nvidia-smi 
Sun Apr 28 13:56:31 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          Off | 00000000:0B:00.0 Off |                    0 |
| N/A   30C    P0              35W / 250W |   3179MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          Off | 00000000:0C:00.0 Off |                    0 |
| N/A   30C    P0              34W / 250W |   4083MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-PCIE-40GB          Off | 00000000:0F:00.0 Off |                    0 |
| N/A   31C    P0              34W / 250W |   4083MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-PCIE-40GB          Off | 00000000:14:00.0 Off |                    0 |
| N/A   29C    P0              36W / 250W |   4083MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA A100-PCIE-40GB          Off | 00000000:15:00.0 Off |                    0 |
| N/A   30C    P0              35W / 250W |   4083MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA A100-PCIE-40GB          Off | 00000000:18:00.0 Off |                    0 |
| N/A   30C    P0              36W / 250W |   4083MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA A100-PCIE-40GB          Off | 00000000:1C:00.0 Off |                    0 |
| N/A   31C    P0              38W / 250W |   4083MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA A100-PCIE-40GB          Off | 00000000:24:00.0 Off |                    0 |
| N/A   30C    P0              37W / 250W |    741MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    338969      C   python                                     3166MiB |
|    1   N/A  N/A    338969      C   python                                     4070MiB |
|    2   N/A  N/A    338969      C   python                                     4070MiB |
|    3   N/A  N/A    338969      C   python                                     4070MiB |
|    4   N/A  N/A    338969      C   python                                     4070MiB |
|    5   N/A  N/A    338969      C   python                                     4070MiB |
|    6   N/A  N/A    338969      C   python                                     4070MiB |
|    7   N/A  N/A    338969      C   python                                      728MiB |
+---------------------------------------------------------------------------------------+

I'm confused and wondering what could be causing this, maybe someone can give me some advice.

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

make docker-compose.yml

>> cat docker-compose.yml 
version: '3.8'
services:
  llama2_api:
    image: ghcr.io/huggingface/text-generation-inference:1.4
    container_name: llama2_api
    command: --model-id /data/llama2/llama2-chat-13b-hf 
    volumes:
      - /data/wanghui01/models/:/data/
    ports:
      - "8081:80"
    environment:
      NVIDIA_VISIBLE_DEVICES: all
      CUDA_VISIBLE_DEVICES: "0,1,2,3,4,5,6,7"
    shm_size: 1g
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Run the container and check the memory

>> docker compose up llama2_api -d 
[+] Running 1/1
 ✔ Container llama2_api  Started  

>> nvidia-smi

Expected behavior

The memory of the Gep is similar to that of the transfomers loading model.

Apr 28 '24 06:04 IdleIdiot

Facing same issue.. any update why this is happening ?

May 13 '24 08:05 canamika27

Facing same issue.. any update why this is happening ?

facing same issue. https://github.com/huggingface/text-generation-inference/issues/1300#issuecomment-1859867587 helped

May 30 '24 13:05 pfan94

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Jun 30 '24 01:06 github-actions[bot]