text-generation-inference
text-generation-inference copied to clipboard
Deepseek R1 fails to start on Gaudi 2
System Info
-----------------------------------------------------------------------------+
| HL-SMI Version: hl-1.20.0-fw-58.1.1.1 |
| Driver Version: 1.20.0-bd87f71 |
| Nic Driver Version: 1.20.0-e4fe12d |
|-------------------------------+----------------------+----------------------+
| AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncor-Events|
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M. |
|===============================+======================+======================|
| 0 HL-225 N/A | 0000:33:00.0 N/A | 0 |
| N/A 28C P0 84W / 600W | 768MiB / 98304MiB | 0% 0% |
|-------------------------------+----------------------+----------------------+
| 1 HL-225 N/A | 0000:9a:00.0 N/A | 0 |
| N/A 25C P0 76W / 600W | 768MiB / 98304MiB | 0% 0% |
|-------------------------------+----------------------+----------------------+
| 2 HL-225 N/A | 0000:9b:00.0 N/A | 0 |
| N/A 33C P0 88W / 600W | 768MiB / 98304MiB | 0% 0% |
|-------------------------------+----------------------+----------------------+
| 3 HL-225 N/A | 0000:34:00.0 N/A | 0 |
| N/A 34C P0 100W / 600W | 768MiB / 98304MiB | 0% 0% |
|-------------------------------+----------------------+----------------------+
| 4 HL-225 N/A | 0000:b3:00.0 N/A | 0 |
| N/A 29C P0 96W / 600W | 768MiB / 98304MiB | 0% 0% |
|-------------------------------+----------------------+----------------------+
| 5 HL-225 N/A | 0000:4d:00.0 N/A | 0 |
| N/A 28C P0 89W / 600W | 768MiB / 98304MiB | 0% 0% |
|-------------------------------+----------------------+----------------------+
| 6 HL-225 N/A | 0000:b4:00.0 N/A | 0 |
| N/A 26C P0 83W / 600W | 768MiB / 98304MiB | 0% 0% |
|-------------------------------+----------------------+----------------------+
| 7 HL-225 N/A | 0000:4e:00.0 N/A | 0 |
| N/A 26C P0 85W / 600W | 768MiB / 98304MiB | 0% 0% |
|-------------------------------+----------------------+----------------------+
| Compute Processes: AIP Memory |
| AIP PID Type Process name Usage |
|=============================================================================|
| 0 N/A N/A N/A N/A |
| 1 N/A N/A N/A N/A |
| 2 N/A N/A N/A N/A |
| 3 N/A N/A N/A N/A |
| 4 N/A N/A N/A N/A |
| 5 N/A N/A N/A N/A |
| 6 N/A N/A N/A N/A |
| 7 N/A N/A N/A N/A |
+=============================================================================+
Python: 3.10.12
Command:
model=deepseek-ai/DeepSeek-R1 \
sudo docker run -p 8080:80 \
--runtime=habana \
--cap-add=sys_nice \
--ipc=host \
-v $HF_HOME:/data \
ghcr.io/huggingface/text-generation-inference:3.2.1-gaudi \
--model-id $model \
--sharded true --num-shard 8 --max-batch-size 4 \
--quantize fp8
Error message:
2025-03-26T12:08:41.546970Z INFO hf_hub: Using token file found "/data/token"
2025-03-26T12:08:42.789349Z WARN text_generation_launcher::gpu: Cannot determine GPU compute capability: AssertionError: Torch not compiled with CUDA enabled
2025-03-26T12:08:42.789369Z INFO text_generation_launcher: Using attention default - Prefix caching 0
2025-03-26T12:08:42.789375Z INFO text_generation_launcher: Sharding model on 8 processes
2025-03-26T12:08:42.789892Z INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4096
2025-03-26T12:08:42.789897Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2025-03-26T12:08:42.789972Z INFO download: text_generation_launcher: Starting check and download process for deepseek-ai/DeepSeek-R1
2025-03-26T12:08:47.065399Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2025-03-26T12:08:47.702335Z INFO download: text_generation_launcher: Successfully downloaded weights for deepseek-ai/DeepSeek-R1
2025-03-26T12:08:47.702547Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2025-03-26T12:08:49.123272Z INFO text_generation_launcher: Running on HPU, the launcher will not do any sharding as actual sharding is done in the server
2025-03-26T12:08:51.600009Z INFO text_generation_launcher: Using prefix caching = False
2025-03-26T12:08:51.600027Z INFO text_generation_launcher: Using Attention = default
2025-03-26T12:08:54.592281Z WARN text_generation_launcher: FBGEMM fp8 kernels are not installed.
2025-03-26T12:08:54.597014Z INFO text_generation_launcher: quantize=fp8
2025-03-26T12:08:55.515925Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
2025-03-26 12:08:48.984 | INFO | text_generation_server.utils.import_utils:<module>:75 - Detected system cpu
/usr/local/lib/python3.10/dist-packages/text_generation_server/utils/sgmv.py:18: UserWarning: Could not import SGMV kernel from Punica, falling back to loop.
warnings.warn("Could not import SGMV kernel from Punica, falling back to loop.")
/usr/local/lib/python3.10/dist-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
Traceback (most recent call last):
File "/usr/local/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 109, in serve
raise RuntimeError(
RuntimeError: Only 1 can be set between `dtype` and `quantize`, as they both decide how goes the final model.
rank=0
2025-03-26T12:08:55.527268Z ERROR text_generation_launcher: Shard 0 failed to start
2025-03-26T12:08:55.527283Z INFO text_generation_launcher: Shutting down shards
Information
- [x] Docker
- [ ] The CLI directly
Tasks
- [x] An officially supported command
- [ ] My own modifications
Reproduction
Download the model using huggingface-cli download deepseek-ai/DeepSeek-R1.
Run the following docker TGI command.
Expected behavior
TGI server initialized correctly.