text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

Deepseek R1 fails to start on Gaudi 2

Open danielfleischer opened this issue 7 months ago • 1 comments

System Info

-----------------------------------------------------------------------------+
| HL-SMI Version:                              hl-1.20.0-fw-58.1.1.1          |
| Driver Version:                                     1.20.0-bd87f71          |
| Nic Driver Version:                                 1.20.0-e4fe12d          |
|-------------------------------+----------------------+----------------------+
| AIP  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncor-Events|
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | AIP-Util  Compute M. |
|===============================+======================+======================|
|   0  HL-225              N/A  | 0000:33:00.0     N/A |                   0  |
| N/A   28C   P0   84W /  600W  |   768MiB /  98304MiB |     0%            0% |
|-------------------------------+----------------------+----------------------+
|   1  HL-225              N/A  | 0000:9a:00.0     N/A |                   0  |
| N/A   25C   P0   76W /  600W  |   768MiB /  98304MiB |     0%            0% |
|-------------------------------+----------------------+----------------------+
|   2  HL-225              N/A  | 0000:9b:00.0     N/A |                   0  |
| N/A   33C   P0   88W /  600W  |   768MiB /  98304MiB |     0%            0% |
|-------------------------------+----------------------+----------------------+
|   3  HL-225              N/A  | 0000:34:00.0     N/A |                   0  |
| N/A   34C   P0  100W /  600W  |   768MiB /  98304MiB |     0%            0% |
|-------------------------------+----------------------+----------------------+
|   4  HL-225              N/A  | 0000:b3:00.0     N/A |                   0  |
| N/A   29C   P0   96W /  600W  |   768MiB /  98304MiB |     0%            0% |
|-------------------------------+----------------------+----------------------+
|   5  HL-225              N/A  | 0000:4d:00.0     N/A |                   0  |
| N/A   28C   P0   89W /  600W  |   768MiB /  98304MiB |     0%            0% |
|-------------------------------+----------------------+----------------------+
|   6  HL-225              N/A  | 0000:b4:00.0     N/A |                   0  |
| N/A   26C   P0   83W /  600W  |   768MiB /  98304MiB |     0%            0% |
|-------------------------------+----------------------+----------------------+
|   7  HL-225              N/A  | 0000:4e:00.0     N/A |                   0  |
| N/A   26C   P0   85W /  600W  |   768MiB /  98304MiB |     0%            0% |
|-------------------------------+----------------------+----------------------+
| Compute Processes:                                               AIP Memory |
|  AIP       PID   Type   Process name                             Usage      |
|=============================================================================|
|   0        N/A   N/A    N/A                                      N/A        |
|   1        N/A   N/A    N/A                                      N/A        |
|   2        N/A   N/A    N/A                                      N/A        |
|   3        N/A   N/A    N/A                                      N/A        |
|   4        N/A   N/A    N/A                                      N/A        |
|   5        N/A   N/A    N/A                                      N/A        |
|   6        N/A   N/A    N/A                                      N/A        |
|   7        N/A   N/A    N/A                                      N/A        |
+=============================================================================+

Python: 3.10.12

Command:

model=deepseek-ai/DeepSeek-R1 \
    sudo docker run -p 8080:80 \
    --runtime=habana \
    --cap-add=sys_nice \
    --ipc=host \
    -v $HF_HOME:/data \
    ghcr.io/huggingface/text-generation-inference:3.2.1-gaudi \
    --model-id $model \
    --sharded true --num-shard 8 --max-batch-size 4 \
    --quantize fp8

Error message:

2025-03-26T12:08:41.546970Z  INFO hf_hub: Using token file found "/data/token"
2025-03-26T12:08:42.789349Z  WARN text_generation_launcher::gpu: Cannot determine GPU compute capability: AssertionError: Torch not compiled with CUDA enabled
2025-03-26T12:08:42.789369Z  INFO text_generation_launcher: Using attention default - Prefix caching 0
2025-03-26T12:08:42.789375Z  INFO text_generation_launcher: Sharding model on 8 processes
2025-03-26T12:08:42.789892Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4096
2025-03-26T12:08:42.789897Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2025-03-26T12:08:42.789972Z  INFO download: text_generation_launcher: Starting check and download process for deepseek-ai/DeepSeek-R1
2025-03-26T12:08:47.065399Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.
2025-03-26T12:08:47.702335Z  INFO download: text_generation_launcher: Successfully downloaded weights for deepseek-ai/DeepSeek-R1
2025-03-26T12:08:47.702547Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2025-03-26T12:08:49.123272Z  INFO text_generation_launcher: Running on HPU, the launcher will not do any sharding as actual sharding is done in the server
2025-03-26T12:08:51.600009Z  INFO text_generation_launcher: Using prefix caching = False
2025-03-26T12:08:51.600027Z  INFO text_generation_launcher: Using Attention = default
2025-03-26T12:08:54.592281Z  WARN text_generation_launcher: FBGEMM fp8 kernels are not installed.
2025-03-26T12:08:54.597014Z  INFO text_generation_launcher: quantize=fp8
2025-03-26T12:08:55.515925Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

2025-03-26 12:08:48.984 | INFO     | text_generation_server.utils.import_utils:<module>:75 - Detected system cpu
/usr/local/lib/python3.10/dist-packages/text_generation_server/utils/sgmv.py:18: UserWarning: Could not import SGMV kernel from Punica, falling back to loop.
  warnings.warn("Could not import SGMV kernel from Punica, falling back to loop.")
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
Traceback (most recent call last):

  File "/usr/local/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 109, in serve
    raise RuntimeError(

RuntimeError: Only 1 can be set between `dtype` and `quantize`, as they both decide how goes the final model.
 rank=0
2025-03-26T12:08:55.527268Z ERROR text_generation_launcher: Shard 0 failed to start
2025-03-26T12:08:55.527283Z  INFO text_generation_launcher: Shutting down shards

Information

  • [x] Docker
  • [ ] The CLI directly

Tasks

  • [x] An officially supported command
  • [ ] My own modifications

Reproduction

Download the model using huggingface-cli download deepseek-ai/DeepSeek-R1.

Run the following docker TGI command.

Expected behavior

TGI server initialized correctly.

danielfleischer avatar Mar 26 '25 12:03 danielfleischer