Model warmup fails after adding Triton indexing kernels
System Info
I was using v2.3.1 via docker and everything was working. When I updated to later versions including the latest my TGI doesn't start due to an error:
2024-12-12T14:26:52.973549Z INFO hf_hub: Token file not found "/data/token"
2024-12-12T14:26:54.846408Z INFO text_generation_launcher: Forcing attention to 'flashdecoding' because head dim is not supported by flashinfer, also disabling prefix caching
2024-12-12T14:26:54.846426Z INFO text_generation_launcher: Using attention flashdecoding - Prefix caching 0
2024-12-12T14:26:54.846433Z INFO text_generation_launcher: Sharding model on 2 processes
2024-12-12T14:26:54.931439Z INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 30821
2024-12-12T14:26:54.931470Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-12-12T14:26:54.931727Z INFO download: text_generation_launcher: Starting check and download process for microsoft/Phi-3.5-mini-instruct
2024-12-12T14:26:57.914690Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-12-12T14:26:58.250499Z INFO download: text_generation_launcher: Successfully downloaded weights for microsoft/Phi-3.5-mini-instruct
2024-12-12T14:26:58.251011Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-12-12T14:26:58.251055Z INFO shard-manager: text_generation_launcher: Starting shard rank=1
2024-12-12T14:27:00.870304Z INFO text_generation_launcher: Using prefix caching = False
2024-12-12T14:27:00.870362Z INFO text_generation_launcher: Using Attention = flashdecoding
2024-12-12T14:27:06.425419Z INFO text_generation_launcher: Using prefill chunking = True
2024-12-12T14:27:06.535239Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
2024-12-12T14:27:06.536669Z INFO text_generation_launcher: Server started at unix:///tmp/text-generation-server-1
2024-12-12T14:27:06.572585Z INFO shard-manager: text_generation_launcher: Shard ready in 8.307980962s rank=0
2024-12-12T14:27:06.578046Z INFO shard-manager: text_generation_launcher: Shard ready in 8.308372036s rank=1
2024-12-12T14:27:06.657793Z INFO text_generation_launcher: Starting Webserver
2024-12-12T14:27:06.739409Z INFO text_generation_router_v3: backends/v3/src/lib.rs:125: Warming up model
2024-12-12T14:27:06.863722Z INFO text_generation_launcher: Using optimized Triton indexing kernels.
2024-12-12T14:27:07.034243Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 321, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 728, in main
return _main(
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 197, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 703, in wrapper
return callback(**use_params)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 117, in serve
server.serve(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.11/site-packages/text_generation_server/interceptor.py", line 24, in intercept
return await response
File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 132, in Warmup
batch = self.model.batch_type.from_pb(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 495, in from_pb
return cls.from_tokenized(pb, tokenizer, batch_tokenized_inputs, dtype, device)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 427, in from_tokenized
block_tables_to_padded(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/metadata_kernels.py", line 42, in block_tables_to_padded
triton_block_tables_to_padded[grid](
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/jit.py", line 345, in <lambda>
return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/jit.py", line 607, in run
device = driver.active.get_current_device()
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py", line 23, in __getattr__
self._initialize_obj()
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
self._obj = self._init_fn()
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py", line 9, in _create_driver
return actives[0]()
File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 371, in __init__
self.utils = CudaUtils() # TODO: make static
File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 80, in __init__
mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src
so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/build.py", line 48, in _build
ret = subprocess.check_call(cc_cmd)
File "/opt/conda/lib/python3.11/subprocess.py", line 413, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmpx2wgfsg0/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmpx2wgfsg0/cuda_utils.cpython-311-x86_64-linux-gnu.so', '-lcuda', '-L/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/lib', '-L/usr/lib64', '-I/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpx2wgfsg0', '-I/opt/conda/include/python3.11']' returned non-zero exit status 1.
2024-12-12T14:27:07.034772Z ERROR warmup{max_input_length=None max_prefill_tokens=30821 max_total_tokens=None max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Command '['/usr/bin/gcc', '/tmp/tmpx2wgfsg0/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmpx2wgfsg0/cuda_utils.cpython-311-x86_64-linux-gnu.so', '-lcuda', '-L/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/lib', '-L/usr/lib64', '-I/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpx2wgfsg0', '-I/opt/conda/include/python3.11']' returned non-zero exit status 1.
2024-12-12T14:27:07.078269Z ERROR text_generation_launcher: Method Warmup encountered an error.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 321, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 728, in main
return _main(
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 197, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 703, in wrapper
return callback(**use_params)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 117, in serve
server.serve(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 315, in serve
asyncio.run(
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.11/site-packages/text_generation_server/interceptor.py", line 24, in intercept
return await response
File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 132, in Warmup
batch = self.model.batch_type.from_pb(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 495, in from_pb
return cls.from_tokenized(pb, tokenizer, batch_tokenized_inputs, dtype, device)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 427, in from_tokenized
block_tables_to_padded(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/metadata_kernels.py", line 42, in block_tables_to_padded
triton_block_tables_to_padded[grid](
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/jit.py", line 345, in <lambda>
return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/jit.py", line 607, in run
device = driver.active.get_current_device()
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py", line 23, in __getattr__
self._initialize_obj()
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
self._obj = self._init_fn()
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/driver.py", line 9, in _create_driver
return actives[0]()
File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 371, in __init__
self.utils = CudaUtils() # TODO: make static
File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 80, in __init__
mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
File "/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src
so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
File "/opt/conda/lib/python3.11/site-packages/triton/runtime/build.py", line 48, in _build
ret = subprocess.check_call(cc_cmd)
File "/opt/conda/lib/python3.11/subprocess.py", line 413, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/usr/bin/gcc', '/tmp/tmp6j5j7_4h/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmp6j5j7_4h/cuda_utils.cpython-311-x86_64-linux-gnu.so', '-lcuda', '-L/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/lib', '-L/usr/lib64', '-I/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/include', '-I/tmp/tmp6j5j7_4h', '-I/opt/conda/include/python3.11']' returned non-zero exit status 1.
2024-12-12T14:27:07.078823Z ERROR warmup{max_input_length=None max_prefill_tokens=30821 max_total_tokens=None max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Command '['/usr/bin/gcc', '/tmp/tmp6j5j7_4h/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmp6j5j7_4h/cuda_utils.cpython-311-x86_64-linux-gnu.so', '-lcuda', '-L/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/lib', '-L/usr/lib64', '-I/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/include', '-I/tmp/tmp6j5j7_4h', '-I/opt/conda/include/python3.11']' returned non-zero exit status 1.
Error: Backend(Warmup(Generation("Command '['/usr/bin/gcc', '/tmp/tmp6j5j7_4h/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmp6j5j7_4h/cuda_utils.cpython-311-x86_64-linux-gnu.so', '-lcuda', '-L/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/lib', '-L/usr/lib64', '-I/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/include', '-I/tmp/tmp6j5j7_4h', '-I/opt/conda/include/python3.11']' returned non-zero exit status 1.")))
2024-12-12T14:27:07.117285Z ERROR text_generation_launcher: Webserver Crashed
2024-12-12T14:27:07.117316Z INFO text_generation_launcher: Shutting down shards
2024-12-12T14:27:07.173251Z INFO shard-manager: text_generation_launcher: Terminating shard rank=0
2024-12-12T14:27:07.173312Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=0
2024-12-12T14:27:07.178761Z INFO shard-manager: text_generation_launcher: Terminating shard rank=1
2024-12-12T14:27:07.178820Z INFO shard-manager: text_generation_launcher: Waiting for shard to gracefully shutdown rank=1
2024-12-12T14:27:08.279806Z INFO shard-manager: text_generation_launcher: shard terminated rank=1
Error: WebserverFailed
2024-12-12T14:27:08.474404Z INFO shard-manager: text_generation_launcher: shard terminated rank=0
This is my nvidia-smi output:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 On | 00000000:0D:00.0 Off | 0 |
| N/A 54C P0 30W / 72W | 1557MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA L4 On | 00000000:37:00.0 Off | 0 |
| N/A 55C P0 28W / 72W | 21989MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA L4 On | 00000000:4A:00.0 Off | 0 |
| N/A 39C P0 27W / 72W | 21659MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA L4 On | 00000000:61:00.0 Off | 0 |
| N/A 37C P0 27W / 72W | 19965MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA L4 On | 00000000:A0:00.0 Off | 0 |
| N/A 46C P8 17W / 72W | 4MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA L4 On | 00000000:B5:00.0 Off | 0 |
| N/A 48C P0 22W / 72W | 193MiB / 23034MiB | 1% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA L4 On | 00000000:CA:00.0 Off | 0 |
| N/A 28C P8 12W / 72W | 1MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA L4 On | 00000000:E1:00.0 Off | 0 |
| N/A 26C P8 12W / 72W | 1MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 137174 C /app/.venv/bin/python 1548MiB |
| 1 N/A N/A 13513 C /opt/conda/bin/python3.11 21980MiB |
| 2 N/A N/A 13518 C /opt/conda/bin/python3.11 21650MiB |
| 3 N/A N/A 13523 C /opt/conda/bin/python3.11 19956MiB |
| 5 N/A N/A 2150019 C /opt/conda/bin/python3.11 184MiB |
+-----------------------------------------------------------------------------------------+
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
here is the TGI env:
{
model_id: "microsoft/Phi-3.5-mini-instruct",
revision: None,
validation_workers: 2,
sharded: None,
num_shard: Some(
2,
),
quantize: None,
speculate: None,
dtype: None,
kv_cache_dtype: None,
trust_remote_code: false,
max_concurrent_requests: 128,
max_best_of: 2,
max_stop_sequences: 4,
max_top_n_tokens: 5,
max_input_tokens: None,
max_input_length: None,
max_total_tokens: None,
waiting_served_ratio: 0.3,
max_batch_prefill_tokens: None,
max_batch_total_tokens: None,
max_waiting_tokens: 20,
max_batch_size: None,
cuda_graphs: None,
hostname: "06ee66ffa08d",
port: 3000,
shard_uds_path: "/tmp/text-generation-server",
master_addr: "localhost",
master_port: 29500,
huggingface_hub_cache: None,
weights_cache_override: None,
disable_custom_kernels: false,
cuda_memory_fraction: 1.0,
rope_scaling: None,
rope_factor: None,
json_output: false,
otlp_endpoint: None,
otlp_service_name: "text-generation-inference.router",
cors_allow_origin: [],
api_key: None,
watermark_gamma: None,
watermark_delta: None,
ngrok: false,
ngrok_authtoken: None,
ngrok_edge: None,
tokenizer_config_path: None,
disable_grammar_support: false,
env: false,
max_client_batch_size: 4,
lora_adapters: None,
usage_stats: On,
payload_limit: 2000000,
enable_prefill_logprobs: false,
}
And here is how I'm running container (running it via podman):
podman create --name=tgi_container --security-opt label=disable --label io.podman.compose.config-hash=XXXXXXXX --label io.podman.compose.project=some-deployment --label io.podman.compose.version=1.0.6 --label [email protected] --label com.docker.compose.project=some-deployment --label com.docker.compose.project.working_dir=/data/some-deployment --label com.docker.compose.project.config_files=docker-compose.yml --label com.docker.compose.container-number=1 --label com.docker.compose.service=tgi --device nvidia.com/gpu=4 --device nvidia.com/gpu=5 -e HUGGING_FACE_HUB_TOKEN=hf_XXXXXXX -e FLASH_DECODING=1 -e PREFILL_CHUNKING=1 -e NCCL_DEBUG=INFO -v /data/tgi/data:/data --net some-deployment_api --network-alias tgi --expose 3000 -p 3000:3000 --shm-size 10gb --restart on-failure ghcr.io/huggingface/text-generation-inference:3.0.1 --port 3000 --model-id microsoft/Phi-3.5-mini-instruct --num-shard 2
Which is generated on my system from running it via a docker compose file.
Expected behavior
The TGI server to start correctly and normally as it had before adding the Triton kernels!
I have the same problem. I assume that you and others who reported the issues below are using the Docker image, and we have reverted to using Triton indexing kernels. Since the process involves compiling C files with calls to Python, but the Python headers are not available for C, we encounter an error.
In simple terms, I assume that Python.h is not available when compiling these shared object files. After reviewing the Dockerfile, it appears that Python3.11-dev is not included in the final image, which is why Python.h is missing.
Just guessing, my "sure" value is about 0.6 😁🤷🏼♂️
It seems to be the same issue as the following issues:
- #2776
- ~~#2835~~ (Not related)
It seems to be the same issue as the following issues:
#2835 not related... it's about gpu split from 2 to 4 H100, no any python stacktrace.
But thanks @KreshLaDoge
Update: I was able to get it working by changing the base image to devel to match the builder image
This line here into FROM nvidia/cuda:12.4.1-devel-ubuntu22.04 AS base
I have to rebuild the image which takes time and increases the size of the image but now it works!!
I don't know how to reproduce the issue Phi3.5 works perfectly under 3.0.1.
I everyone here using podman ? I don't see why it should make any difference though..
can also every confirm they are using 3.0.1 and not latest ?
can also every confirm they are using
3.0.1and notlatest?
I can confirm the issue for me with 3.0.1, 3.0, and 2.4.0
I also had issues with 3.0.1
I suspect that it’s about missing Python.h, which would also explain why it worked for @YaserJaradeh when he changed the base image of Ubuntu to devel variant. But it can be something else.
Currently, I’m forced to assing GPUs to container manually and not through the Nvidia container toolkit, so it might be related if others, experiencing the same issue, are using vGPUs for example 🤷
Currently, I’m forced to assing GPUs to container manually and not through the Nvidia container toolkit,
Can you elaborate ? It might be a potential culprit.
Currently, I’m forced to assing GPUs to container manually and not through the Nvidia container toolkit,
Can you elaborate ? It might be a potential culprit.
Here is example of my docker compose and our way, how we assign GPUs - don't judge me, it has it’s own reasons why I can't use container toolkit 🤷
Anyway, I doubt that anyone else experiencing this issue has similar configuration.
I also assign the GPUs manually to my container
I get the same errors with the latest docker image. So far I tested Mixtral 8x7B and llama 3.3 and both had the same error.
In short, this command
podman run --net=host --device nvidia.com/gpu=0 --device nvidia.com/gpu=1 --shm-size=10g -v /data/cache/huggingface:/data --rm ghcr.io/huggingface/text-generation-inference --model-id meta-llama/Llama-3.3-70B-Instruct --port 8000 --num-shard 2
leads to the following error:
2025-01-13T09:58:46.610304Z ERROR warmup{max_input_length=None max_prefill_tokens=4226 max_total_tokens=None max_batch_size=None}:warmup: text_generation_router_v3::client: backends/v3/src/client/mod.rs:45: Server error: Command '['/usr/bin/gcc', '/tmp/tmpspnxcoyi/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmpspnxcoyi/cuda_utils.cpython-311-x86_64-linux-gnu.so', '-lcuda', '-L/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/lib', '-L/usr/lib64', '-I/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpspnxcoyi', '-I/opt/conda/include/python3.11']' returned non-zero exit status 1.
Error: Backend(Warmup(Generation("Command '['/usr/bin/gcc', '/tmp/tmpspnxcoyi/main.c', '-O3', '-shared', '-fPIC', '-o', '/tmp/tmpspnxcoyi/cuda_utils.cpython-311-x86_64-linux-gnu.so', '-lcuda', '-L/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/lib', '-L/usr/lib64', '-I/opt/conda/lib/python3.11/site-packages/triton/backends/nvidia/include', '-I/tmp/tmpspnxcoyi', '-I/opt/conda/include/python3.11']' returned non-zero exit status 1.")))
2025-01-13T09:58:46.669343Z ERROR text_generation_launcher: Webserver Crashed
nvidia-smi:
$ nvidia-smi
Mon Jan 13 11:09:28 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:27:00.0 Off | 0 |
| N/A 35C P0 45W / 300W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100 80GB PCIe On | 00000000:A3:00.0 Off | 0 |
| N/A 35C P0 45W / 300W | 1MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A100-PCIE-40GB On | 00000000:C3:00.0 Off | 0 |
| N/A 38C P0 37W / 250W | 3187MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
I am wondering why there aren't more comments in this thread. Is there a workaround?
@scriptator maybe you can try building the image that I have here https://github.com/huggingface/text-generation-inference/pull/2848 and see if that works for you
@scriptator maybe you can try building the image that I have here #2848 and see if that works for you
I can confirm that this change works for me. Thx a lot!
@scriptator it is good to have a confirmation that this works! I'm still trying to figure out with the PR what is the best way to do without changing the base image to devel because it increases the size of the final image! but couldn't get it to work so far.
@scriptator it is good to have a confirmation that this works! I'm still trying to figure out with the PR what is the best way to do without changing the base image to devel because it increases the size of the final image! but couldn't get it to work so far.
Good point - the image I built with your pull request 2 hours ago is 20GB, which is quite a lot compared to the 12,8 GB of the official image.
Minimal reproduction command: podman run --device nvidia.com/gpu=0 --rm -it --entrypoint python ghcr.io/huggingface/text-generation-inference -c "from torch.utils._triton import triton_backend; triton_backend()"
It's starting to look like a podman bug... I cannot reproduce with the minimal reproducer....
Which version of podman are you using ? (5.3.1 here)
I can't reproduce any of the issues even with podman on my end...
What are the host configs? GPU, cuda version, every potential service, driver on the nodes etc.. ? It seems very odd that triton wants to recompile something that low level..
Getting some output for this: https://github.com/huggingface/text-generation-inference/pull/2848#issuecomment-2612822913 would be helpful in understanding the issue!
In our (= @scriptator's) case, the problem has disappeared. We cannot be sure of the cause, but maybe my notes are of help to someone:
The server where the problem occurred was running RHEL 9.5, but with a kernel from RHEL 9.3, which was required due to an issue with another inference framework. We could finally upgrade to a current kernel last week, and since then this issue does not occur for us anymore. However, at the same time, some Nvidia libraries were also upgraded.
I don't know whether upgrading the kernel or the Nvidia libraries (or just the subsequent reboot) fixed the issue for us.
In our (= @scriptator's) case, the problem has disappeared. We cannot be sure of the cause, but maybe my notes are of help to someone:
The server where the problem occurred was running RHEL 9.5, but with a kernel from RHEL 9.3, which was required due to an issue with another inference framework. We could finally upgrade to a current kernel last week, and since then this issue does not occur for us anymore. However, at the same time, some Nvidia libraries were also upgraded.
I don't know whether upgrading the kernel or the Nvidia libraries (or just the subsequent reboot) fixed the issue for us.
We are running Rocky 8.7, I will confirm if upgrade solved the issue when we update kernel.