Constrained system/cpu RAM prohibits loading even with enough GPU Memory

Open ptschandl opened this issue 1 year ago • 2 comments

System Info

The issue occurred with OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5 on a machine with 32GB RAM and a single RTX 6000 Ada (48GB), where the shard loading aborts, but loading with raw huggingface commands works without 8bit:

Does not work

$ model=OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5
$ num_shard=1
$ volume=$PWD/data
$ docker run --gpus all --shm-size 2g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id $model --num-shard $num_shard --env

Works

Python 3.8.11 (default, Aug  3 2021, 15:09:35)                                                                                                                                                             
[GCC 7.5.0] :: Anaconda, Inc. on linux                                                                                                                                                                     
Type "help", "copyright", "credits" or "license" for more information.                                                                                                                                     
>>> from transformers import AutoModelForCausalLM                                                                                                                                                          
>>> model_name = "OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5"                                                                                                                                          
>>> model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")                                                                                                                            
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:59<00:00, 19.76s/it]
>>> model                                                                                                                                                                                                  
GPTNeoXForCausalLM(                                                                                                                                                                                        
  (gpt_neox): GPTNeoXModel( 
   ...

$ nvidia-smi
Mon Jun  5 18:07:48 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX 6000...  Off  | 00000000:01:00.0  On |                  Off |
| 30%   45C    P8    26W / 300W |  47325MiB / 49140MiB |     15%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2680      G   /usr/lib/xorg/Xorg                 69MiB |
|    0   N/A  N/A      5034      C   python                          47252MiB |
+-----------------------------------------------------------------------------+

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

For reproducibility, these commands with artifically constrained container --memory should showcase the situation in an easier fashion with bigscience/bloom-560m:

Works

$ model=bigscience/bloom-560m
$ num_shard=1
$ volume=$PWD/data

$ docker run --memory=16g --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id $model --num-shard $num_shard --env
2023-06-05T16:09:59.218696Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.69.0
Commit sha: e7248fe90e27c7c8e39dd4cac5874eb9f96ab182
Docker label: sha-e7248fe
nvidia-smi:
Mon Jun  5 16:09:59 2023       
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  NVIDIA RTX 6000...  Off  | 00000000:01:00.0  On |                  Off |
   | 30%   44C    P8    26W / 300W |     70MiB / 49140MiB |      3%      Default |
   |                               |                      |                  N/A |
   +-------------------------------+----------------------+----------------------+
                                                                                  
   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   +-----------------------------------------------------------------------------+
2023-06-05T16:09:59.218719Z  INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, sharded: None, num_shard: Some(1), quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: true }
2023-06-05T16:09:59.218845Z  INFO text_generation_launcher: Starting download process.
2023-06-05T16:10:02.259356Z  INFO download: text_generation_launcher: Files are already present on the host. Skipping download.

2023-06-05T16:10:02.824149Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-06-05T16:10:02.824265Z  INFO text_generation_launcher: Starting shard 0
2023-06-05T16:10:12.835066Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:10:22.844677Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:10:32.853804Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:10:42.864204Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:10:52.874863Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:10:55.290781Z  INFO shard-manager: text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
 rank=0
2023-06-05T16:10:55.377836Z  INFO text_generation_launcher: Shard 0 ready in 52.553095127s
2023-06-05T16:10:55.471066Z  INFO text_generation_launcher: Starting Webserver
2023-06-05T16:10:56.788354Z  INFO text_generation_router: router/src/main.rs:178: Connected

$ docker stats
CONTAINER ID   NAME                CPU %     MEM USAGE / LIMIT   MEM %     NET I/O         BLOCK I/O   PIDS
de4cafd33d62   distracted_banach   0.00%     2.652GiB / 16GiB    16.58%    15MB / 81.2kB   0B / 0B     34

Does not work

$ model=bigscience/bloom-560m
$ num_shard=1
$ volume=$PWD/data

$ docker run --memory=1g --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id $model --num-shard $num_shard --env
2023-06-05T16:13:41.108681Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.69.0
Commit sha: e7248fe90e27c7c8e39dd4cac5874eb9f96ab182
Docker label: sha-e7248fe
nvidia-smi:
Mon Jun  5 16:13:40 2023       
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  NVIDIA RTX 6000...  Off  | 00000000:01:00.0  On |                  Off |
   | 30%   46C    P8    26W / 300W |     70MiB / 49140MiB |      3%      Default |
   |                               |                      |                  N/A |
   +-------------------------------+----------------------+----------------------+
                                                                                  
   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   +-----------------------------------------------------------------------------+
2023-06-05T16:13:41.108699Z  INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, sharded: None, num_shard: Some(1), quantize: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: true }
2023-06-05T16:13:41.108774Z  INFO text_generation_launcher: Starting download process.
2023-06-05T16:13:44.665356Z  INFO download: text_generation_launcher: Files are already present on the host. Skipping download.

2023-06-05T16:13:44.913616Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-06-05T16:13:44.913968Z  INFO text_generation_launcher: Starting shard 0
2023-06-05T16:13:54.925231Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:14:04.936122Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:14:14.945694Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:14:24.957642Z  INFO text_generation_launcher: Waiting for shard 0 to be ready...
2023-06-05T16:14:31.054288Z ERROR text_generation_launcher: Shard 0 failed to start:

2023-06-05T16:14:31.054824Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

$ docker stats
CONTAINER ID   NAME             CPU %     MEM USAGE / LIMIT   MEM %     NET I/O           BLOCK I/O   PIDS
7f0768e9f43a   modest_vaughan   0.00%     1023MiB / 1GiB      99.91%    20.8kB / 3.74kB   0B / 0B     6

Expected behavior

The model loads directly onto the GPU that is large enough even with constrained system/cpu RAM.

Jun 05 '23 16:06 ptschandl

text-generation-inference text-generation-inference copied to clipboard

Constrained system/cpu RAM prohibits loading even with enough GPU Memory

System Info

Does not work

Works

Information

Tasks

Reproduction

Works

Does not work

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard