It seems that `gpt_oss` model type is not supported by the latest HuggingFace runtime image
/kind bug
What steps did you take and what happened:
I tried to run https://huggingface.co/openai/gpt-oss-20b model using HuggingFace runtime (kserve/huggingfaceserver:latest-gpu and kserve/huggingfaceserver:v0.15.2-gpu) and received such exception inside runtime container:
INFO 08-21 08:13:19 [__init__.py:244] Automatically detected platform cuda.
Traceback (most recent call last):
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/transformers/models/auto/configuration_auto.py", line 1131, in from_pretrained
config_class = CONFIG_MAPPING[config_dict["model_type"]]
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/transformers/models/auto/configuration_auto.py", line 833, in __getitem__
raise KeyError(key)
KeyError: 'gpt_oss'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/kserve-workspace/huggingfaceserver/huggingfaceserver/__main__.py", line 162, in <module>
if is_vllm_backend_enabled(initial_args, model_id_or_path):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/kserve-workspace/huggingfaceserver/huggingfaceserver/__main__.py", line 64, in is_vllm_backend_enabled
and infer_vllm_supported_from_model_architecture(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/kserve-workspace/huggingfaceserver/huggingfaceserver/vllm/utils.py", line 50, in infer_vllm_supported_from_model_architecture
model_config = AutoConfig.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/transformers/models/auto/configuration_auto.py", line 1133, in from_pretrained
raise ValueError(
ValueError: The checkpoint you are trying to load has model type `gpt_oss` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out
of date.
You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can g
et the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`
What did you expect to happen: Running model with no errors inside isvc's pod containers, as it was with https://huggingface.co/Qwen/Qwen2-0.5B for example
What's the InferenceService yaml:
This is the version which I tried to use latest-gpu instead of v0.15.2-gpu (as before, but with the same result)
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
finalizers:
- inferenceservice.finalizers
name: gpt-oss-20b
namespace: gpt-oss
spec:
predictor:
annotations:
serving.knative.dev/progress-deadline: 1740s
model:
modelFormat:
name: huggingface
name: ""
protocolVersion: v1
resources:
limits:
cpu: "16"
memory: "85899345920"
nvidia.com/gpu: "1"
requests:
cpu: "16"
memory: "85899345920"
nvidia.com/gpu: "1"
runtime: invu-huggingfaceserver
runtimeVersion: latest-gpu
storageUri: s3://cotype/6f3a1019-4e9e-46e3-b70e-d835e93a426f.zip
serviceAccountName: gpt-oss-svc-acc
status:
components:
predictor:
latestCreatedRevision: gpt-oss-20b-predictor-00001
conditions:
- lastTransitionTime: "2025-08-20T19:23:53Z"
reason: PredictorConfigurationReady not ready
severity: Info
status: Unknown
type: LatestDeploymentReady
- lastTransitionTime: "2025-08-20T19:23:53Z"
severity: Info
status: Unknown
type: PredictorConfigurationReady
- lastTransitionTime: "2025-08-20T19:23:53Z"
message: Configuration "gpt-oss-20b-predictor" is waiting for a Revision to become
ready.
reason: RevisionMissing
status: Unknown
type: PredictorReady
- lastTransitionTime: "2025-08-20T19:23:53Z"
message: Configuration "gpt-oss-20b-predictor" is waiting for a Revision to become
ready.
reason: RevisionMissing
severity: Info
status: Unknown
type: PredictorRouteReady
- lastTransitionTime: "2025-08-20T19:23:53Z"
message: Configuration "gpt-oss-20b-predictor" is waiting for a Revision to become
ready.
reason: RevisionMissing
status: Unknown
type: Ready
- lastTransitionTime: "2025-08-20T19:23:53Z"
reason: PredictorRouteReady not ready
severity: Info
status: Unknown
type: RoutesReady
modelStatus:
states:
activeModelState: ""
targetModelState: Pending
transitionStatus: InProgress
observedGeneration: 1
Anything else you would like to add:
You can see custom ClusterServingRuntime invu-huggingfaceserver, it is just a copy of kserve-huggingfaceserver with image tag explicitly set as v0.15.2-gpu to pin specific version instead of latest-gpu
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
name: invu-huggingfaceserver
spec:
annotations:
prometheus.kserve.io/path: /metrics
prometheus.kserve.io/port: "8080"
containers:
- args:
- --model_name={{.Name}}
- --model_dir=/mnt/models
image: kserve/huggingfaceserver:v0.15.2-gpu
name: kserve-container
resources:
limits:
cpu: "1"
memory: 2Gi
requests:
cpu: "1"
memory: 2Gi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
privileged: false
runAsNonRoot: true
protocolVersions:
- v2
- v1
supportedModelFormats:
- autoSelect: false
name: huggingface
version: "1"
Environment:
- Istio Version: 1.25.2
- Knative Version: v1.18.0
- KServe Version: v0.15.0
- Kubeflow version: -
- Cloud Environment: -
- Minikube/Kind version: -
- Kubernetes version: (use
kubectl version): Server Version: v1.32.4 - OS (e.g. from
/etc/os-release): Ubuntu 22.04.5 LTS
this matches Problem No.7: provider / runtime model compatibility (unrecognized model type at the HF runtime). itβs a semantic firewall issue so you usually do not need wide infra changes; align the model type to a supported runtime image and use WFGY 3.0 to stabilize reasoning when mapping providers and model configs. if you want the checklist and sample env fixes, tell me and i will share the ProblemMap link.
I also tried to specify --backend arg:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
finalizers:
- inferenceservice.finalizers
name: gpt-oss-20b
namespace: gpt-oss
spec:
predictor:
annotations:
serving.knative.dev/progress-deadline: 1740s
model:
args:
- --backend=huggingface
modelFormat:
name: huggingface
name: ""
protocolVersion: v1
resources:
limits:
cpu: "16"
memory: "85899345920"
nvidia.com/gpu: "0"
requests:
cpu: "16"
memory: "85899345920"
nvidia.com/gpu: "0"
runtime: invu-huggingfaceserver
runtimeVersion: latest-gpu
storageUri: s3://cotype/3bd25443-bb89-46e3-b7f6-d90f07043043.zip
serviceAccountName: gpt-oss-svc-acc
status:
components:
predictor:
latestCreatedRevision: gpt-oss-20b-predictor-00001
conditions:
- lastTransitionTime: "2025-08-21T08:25:02Z"
reason: PredictorConfigurationReady not ready
severity: Info
status: Unknown
type: LatestDeploymentReady
- lastTransitionTime: "2025-08-21T08:25:02Z"
severity: Info
status: Unknown
type: PredictorConfigurationReady
- lastTransitionTime: "2025-08-21T08:25:02Z"
message: Configuration "gpt-oss-20b-predictor" is waiting for a Revision to become
ready.
reason: RevisionMissing
status: Unknown
type: PredictorReady
- lastTransitionTime: "2025-08-21T08:25:02Z"
message: Configuration "gpt-oss-20b-predictor" is waiting for a Revision to become
ready.
reason: RevisionMissing
severity: Info
status: Unknown
type: PredictorRouteReady
- lastTransitionTime: "2025-08-21T08:25:02Z"
message: Configuration "gpt-oss-20b-predictor" is waiting for a Revision to become
ready.
reason: RevisionMissing
status: Unknown
type: Ready
- lastTransitionTime: "2025-08-21T08:25:02Z"
reason: PredictorRouteReady not ready
severity: Info
status: Unknown
type: RoutesReady
modelStatus:
states:
activeModelState: ""
targetModelState: Pending
transitionStatus: InProgress
observedGeneration: 1
But received the same error:
INFO 08-21 08:29:01 [__init__.py:244] Automatically detected platform cuda.
2025-08-21 08:29:03.438 1 kserve INFO [storage.py:download():64] Copying contents of /mnt/models to local
2025-08-21 08:29:03.438 1 kserve INFO [storage.py:download():110] Successfully copied /mnt/models to None
2025-08-21 08:29:03.438 1 kserve INFO [storage.py:download():111] Model downloaded in 0.00044586299918591976 seconds.
2025-08-21 08:29:03.439 1 kserve ERROR [__main__.py:<module>():324] Failed to start model server: The checkpoint you are trying to load has model type `gpt_oss` but Transformers does not recognize this architecture. This could be b
ecause of an issue with the checkpoint, or because your version of Transformers is out of date.
You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can g
et the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`
Traceback (most recent call last):
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/transformers/models/auto/configuration_auto.py", line 1131, in from_pretrained
config_class = CONFIG_MAPPING[config_dict["model_type"]]
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/transformers/models/auto/configuration_auto.py", line 833, in __getitem__
raise KeyError(key)
KeyError: 'gpt_oss'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/kserve-workspace/huggingfaceserver/huggingfaceserver/__main__.py", line 314, in <module>
model = load_model()
^^^^^^^^^^^^
File "/kserve-workspace/huggingfaceserver/huggingfaceserver/__main__.py", line 234, in load_model
model_config = AutoConfig.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/transformers/models/auto/configuration_auto.py", line 1133, in from_pretrained
raise ValueError(
ValueError: The checkpoint you are trying to load has model type `gpt_oss` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out
of date.
You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can g
et the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git
To resolve the issues, I tried to upgrade transformers and vllm python packages inside HuggingFace runtime image
Here is my Docker image π
FROM kserve/huggingfaceserver:latest-gpu
RUN pip install --upgrade transformers vllm
docker build -t 78945789345654/hf-gpu-fix:new-transformers-new-vllm .
docker run -it --rm --entrypoint=pip 78945789345654/hf-gpu-fix:new-transformers-new-vllm freeze | grep transformers
> transformers==4.55.2
docker run -it --rm --entrypoint=pip 78945789345654/hf-gpu-fix:new-transformers-new-vllm freeze | grep vllm
> vllm==0.10.1
And when I started my model locally, the error is gone:
docker run \
--tty \
--rm \
--volume $PWD/gpt-oss-20b:/mnt/models \
78945789345654/hf-gpu-fix:new-transformers-new-vllm \
--model_name=gpt-oss-20b \
--model_dir=/mnt/models
In comparison with, where the error is present:
docker run \
--tty \
--rm \
--volume $PWD/gpt-oss-20b:/mnt/models \
kserve/huggingfaceserver:latest-gpu \
--model_name=gpt-oss-20b \
--model_dir=/mnt/models
So I decided to upload this image and use it inside k8s cluster:
ClusterServingRuntime:
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
name: invu-huggingfaceserver
spec:
annotations:
prometheus.kserve.io/path: /metrics
prometheus.kserve.io/port: "8080"
containers:
- args:
- --model_name={{.Name}}
- --model_dir=/mnt/models
image: 78945789345654/hf-gpu-fix:latest-gpu # <--- Here is my image
name: kserve-container
resources:
limits:
cpu: "1"
memory: 2Gi
requests:
cpu: "1"
memory: 2Gi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
privileged: false
runAsNonRoot: true
protocolVersions:
- v2
- v1
supportedModelFormats:
- autoSelect: false
name: huggingface
version: "1"
ISVC:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
finalizers:
- inferenceservice.finalizers
name: gpt-oss-20b
namespace: gpt-oss
spec:
predictor:
annotations:
serving.knative.dev/progress-deadline: 1740s
model:
modelFormat:
name: huggingface
name: ""
protocolVersion: v1
resources:
limits:
cpu: "16"
memory: "85899345920"
nvidia.com/gpu: "1"
requests:
cpu: "16"
memory: "85899345920"
nvidia.com/gpu: "1"
runtime: invu-huggingfaceserver
runtimeVersion: latest-gpu
storageUri: s3://cotype/3bd25443-bb89-46e3-b7f6-d90f07043043.zip
serviceAccountName: gpt-oss-svc-acc
And got new exception:
INFO 08-21 09:08:23 [__init__.py:241] Automatically detected platform cuda.
WARNING 08-21 09:08:24 [__init__.py:1734] argument 'disable_log_requests' is deprecated
2025-08-21 09:08:24.321 1 kserve INFO [storage.py:download():64] Copying contents of /mnt/models to local
2025-08-21 09:08:24.321 1 kserve INFO [storage.py:download():110] Successfully copied /mnt/models to None
2025-08-21 09:08:24.321 1 kserve INFO [storage.py:download():111] Model downloaded in 0.0001523587852716446 seconds.
2025-08-21 09:08:24.352 1 kserve INFO [model_server.py:register_model():406] Registering model: gpt-oss-20b
2025-08-21 09:08:24.353 1 kserve INFO [model_server.py:setup_event_loop():286] Setting max asyncio worker threads as 32
2025-08-21 09:08:24.379 1 kserve INFO [server.py:_register_endpoints():108] OpenAI endpoints registered
2025-08-21 09:08:24.379 1 kserve INFO [server.py:start():161] Starting uvicorn with 1 workers
2025-08-21 09:08:24.429 1 uvicorn.error INFO: Started server process [1]
2025-08-21 09:08:24.429 1 uvicorn.error INFO: Waiting for application startup.
2025-08-21 09:08:24.431 1 kserve INFO [server.py:start():70] Starting gRPC server with 4 workers
2025-08-21 09:08:24.432 1 kserve INFO [server.py:start():71] Starting gRPC server on [::]:8081
2025-08-21 09:08:24.432 1 uvicorn.error INFO: Application startup complete.
2025-08-21 09:08:24.439 1 uvicorn.error ERROR: Traceback (most recent call last):
File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/kserve/model_server.py", line 301, in _serve
await asyncio.gather(*self.servers)
File "/kserve-workspace/huggingfaceserver/huggingfaceserver/vllm/vllm_model.py", line 101, in start_engine
self.args.enable_reasoning
AttributeError: 'Namespace' object has no attribute 'enable_reasoning'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/starlette/routing.py", line 699, in lifespan
await receive()
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/uvicorn/lifespan/on.py", line 137, in receive
return await self.receive_queue.get()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/queues.py", line 158, in get
await getter
asyncio.exceptions.CancelledError
2025-08-21 09:08:24.439 1 kserve ERROR [__main__.py:<module>():324] Failed to start model server: 'Namespace' object has no attribute 'enable_reasoning'
Traceback (most recent call last):
File "/kserve-workspace/huggingfaceserver/huggingfaceserver/__main__.py", line 320, in <module>
model_server.start([model])
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/kserve/model_server.py", line 344, in start
asyncio.run(self._serve())
File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/kserve/model_server.py", line 301, in _serve
await asyncio.gather(*self.servers)
File "/kserve-workspace/huggingfaceserver/huggingfaceserver/vllm/vllm_model.py", line 101, in start_engine
self.args.enable_reasoning
AttributeError: 'Namespace' object has no attribute 'enable_reasoning'
Error in sys.excepthook:
Original exception was:
So I went deeper and tried to change runtime source code to respect new vllm package requirements, as --enable-reasoning became deprecated and it no longer may be parsed from args and prompt_adapters became stale, as I can see
Here is my new image:
FROM kserve/huggingfaceserver:latest-gpu
RUN pip install --upgrade transformers vllm
COPY ./patched_vllm_model.py /kserve-workspace/huggingfaceserver/huggingfaceserver/vllm/vllm_model.py
You can see patched_vllm_model.py here https://github.com/kserve/kserve/pull/4659/files#diff-7df5106c0f3e1672869e494b083850402bacf0f43e48e960f9935f2d93accdfe
And after launching this image I got working runtime:
kserve-container INFO 08-21 11:10:51 [__init__.py:241] Automatically detected platform cuda.
kserve-container WARNING 08-21 11:10:52 [__init__.py:1734] argument 'disable_log_requests' is deprecated
kserve-container 2025-08-21 11:10:52.469 1 kserve INFO [storage.py:download():64] Copying contents of /mnt/models to local
kserve-container 2025-08-21 11:10:52.469 1 kserve INFO [storage.py:download():110] Successfully copied /mnt/models to None
kserve-container 2025-08-21 11:10:52.469 1 kserve INFO [storage.py:download():111] Model downloaded in 0.00015321210958063602 seconds.
kserve-container 2025-08-21 11:10:52.501 1 kserve INFO [model_server.py:register_model():406] Registering model: gpt-oss-20b
kserve-container 2025-08-21 11:10:52.502 1 kserve INFO [model_server.py:setup_event_loop():286] Setting max asyncio worker threads as 32
kserve-container INFO 08-21 11:10:57 [__init__.py:711] Resolved architecture: GptOssForCausalLM
kserve-container ERROR 08-21 11:10:57 [config.py:130] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/models'. Use `repo_type` argument if needed., retrying 1 of 2
kserve-container ERROR 08-21 11:10:59 [config.py:128] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/models'. Use `repo_type` argument if needed.
kserve-container INFO 08-21 11:10:59 [__init__.py:2816] Downcasting torch.float32 to torch.bfloat16.
kserve-container INFO 08-21 11:10:59 [__init__.py:1750] Using max model len 131072
kserve-container WARNING 08-21 11:11:00 [__init__.py:1171] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
kserve-container INFO 08-21 11:11:01 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
kserve-container INFO 08-21 11:11:01 [config.py:273] Overriding max cuda graph capture size to 1024 for performance.
kserve-container INFO 08-21 11:11:05 [__init__.py:241] Automatically detected platform cuda.
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:07 [core.py:636] Waiting for init message from front-end.
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:07 [core.py:74] Initializing a V1 LLM engine (v0.10.1) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenize
r_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=
1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_addi
tional_properties=False, reasoning_backend='GptOss'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/mnt/models, e
nable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_
ttention","vllm.unified_attention_with_output","vllm.mamba_mixer2"[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":t
rue,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[1024,1008,992,976,960,944,928,912,896,880,864,848,832,816,800,784,768,752,736,720,704,688,672,656,640,624,608,592,576,560,544,528,512,496,480,464,448,432,416,400,384,368,3
2,336,320,304,288,272,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_si
ze":1024,"local_cache_dir":null}
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:08 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:08 [topk_topp_sampler.py:50] Using FlashInfer for top-p & top-k sampling.
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:08 [gpu_model_runner.py:1953] Starting to load model /mnt/models...
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:08 [gpu_model_runner.py:1985] Loading model from scratch...
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:08 [cuda.py:323] Using Triton backend on V1 engine.
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:08 [triton_attn.py:257] Using vllm unified attention for TritonAttentionImpl
kserve-container (EngineCore_0 pid=144) Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
kserve-container (EngineCore_0 pid=144) Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:03<00:07, 3.74s/it]
kserve-container (EngineCore_0 pid=144) Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:08<00:04, 4.50s/it]
kserve-container (EngineCore_0 pid=144) Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:13<00:00, 4.70s/it]
kserve-container (EngineCore_0 pid=144) Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:13<00:00, 4.57s/it]
kserve-container (EngineCore_0 pid=144)
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:22 [default_loader.py:262] Loading weights took 13.78 seconds
kserve-container (EngineCore_0 pid=144) WARNING 08-21 11:11:22 [marlin_utils_fp4.py:196] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leverag
ing the Marlin kernel. This may degrade performance for compute-heavy workloads.
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:23 [gpu_model_runner.py:2007] Model loading took 13.7194 GiB and 14.309837 seconds
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:28 [backends.py:548] Using cache directory: /home/kserve/.cache/vllm/torch_compile_cache/3a3580bf2a/rank_0_0/backbone for vLLM's torch.compile
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:28 [backends.py:559] Dynamo bytecode transform time: 5.36 s
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:31 [backends.py:194] Cache the graph for dynamic shape for later use
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:59 [backends.py:215] Compiling a graph for dynamic shape takes 29.77 s
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:12:00 [marlin_utils.py:353] You are running Marlin kernel with bf16 on GPUs before SM90. You can consider change to fp16 to achieve better performance if possible.
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:12:02 [monitor.py:34] torch.compile takes 35.13 s in total
kserve-container (EngineCore_0 pid=144) /kserve-workspace/prod_venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilat
ion.
kserve-container (EngineCore_0 pid=144) If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
kserve-container (EngineCore_0 pid=144) warnings.warn(
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:12:02 [gpu_worker.py:276] Available KV cache memory: 56.90 GiB
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:12:03 [kv_cache_utils.py:1013] GPU KV cache size: 1,242,912 tokens
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:12:03 [kv_cache_utils.py:1017] Maximum concurrency for 131,072 tokens per request: 18.65x
kserve-container (EngineCore_0 pid=144) Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 0%| | 0/83 [00:00<?, ?it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 2%|β | 2/83 [00:00<00:06,
12.64it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 5%|β | 4/83 [00:00<00:05, 13.49it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 7%|β | 6/83 [00:00<00:05, 14.13it/s]Capturi
ng CUDA graphs (mixed prefill-decode, PIECEWISE): 10%|β | 8/83 [00:00<00:05, 14.47it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 12%|ββ | 10/83 [00:00<00:04, 14.80it/s]Capturing CUDA graphs (
mixed prefill-decode, PIECEWISE): 14%|ββ | 12/83 [00:00<00:04, 15.02it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 17%|ββ | 14/83 [00:00<00:04, 15.29it/s]Capturing CUDA graphs (mixed prefill
-decode, PIECEWISE): 19%|ββ | 16/83 [00:01<00:04, 15.53it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 22%|βββ | 18/83 [00:01<00:04, 15.79it/s]Capturing CUDA graphs (mixed prefill-decode, PI
ECEWISE): 24%|βββ | 20/83 [00:01<00:03, 16.08it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 27%|βββ | 22/83 [00:01<00:03, 16.45it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):
29%|βββ | 24/83 [00:01<00:03, 16.68it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 31%|ββββ | 26/83 [00:01<00:03, 17.16it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 34%|οΏ½
οΏ½οΏ½βββ | 28/83 [00:01<00:03, 17.51it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 36%|ββββ | 30/83 [00:01<00:02, 18.07it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 39%|ββ
ββ | 32/83 [00:01<00:02, 18.29it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 42%|βββββ | 35/83 [00:02<00:02, 19.10it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 46%|βββ
ββ | 38/83 [00:02<00:02, 19.79it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 49%|βββββ | 41/83 [00:02<00:02, 20.28it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 53%|βββοΏ½
οΏ½οΏ½ββ | 44/83 [00:02<00:01, 21.00it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 57%|ββββββ | 47/83 [00:02<00:01, 21.81it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 60%|ββοΏ½
οΏ½οΏ½βββ | 50/83 [00:02<00:01, 22.56it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 64%|βββββββ | 53/83 [00:02<00:01, 23.31it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 67%|οΏ½οΏ½
οΏ½ββββββ | 56/83 [00:03<00:01, 23.86it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 71%|βββββββ | 59/83 [00:03<00:00, 24.54it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):
75%|ββββββββ | 62/83 [00:03<00:00, 24.93it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 78%|ββββββββ | 65/83 [00:03<00:00, 25.16it/s]Capturing CUDA graphs (mixed prefill-decode, P
IECEWISE): 82%|βββββββββ | 68/83 [00:03<00:00, 25.30it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 86%|βββββββββ | 71/83 [00:03<00:00, 25.42it/s]Capturing CUDA graphs (mixed pr
efill-decode, PIECEWISE): 89%|βββββββββ | 74/83 [00:03<00:00, 25.45it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 93%|ββββββββββ| 77/83 [00:03<00:00, 25.21it/s]Capturing CUDA
graphs (mixed prefill-decode, PIECEWISE): 96%|ββββββββββ| 80/83 [00:03<00:00, 25.46it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|ββββββββββ| 83/83 [00:04<00:00, 25.54i
t/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|ββββββββββ| 83/83 [00:04<00:00, 20.37it/s]
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:12:07 [gpu_model_runner.py:2708] Graph capturing finished in 4 secs, took 0.79 GiB
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:12:07 [core.py:214] init engine (profile, create kv cache, warmup model) took 44.21 seconds
kserve-container INFO 08-21 11:12:09 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 155365
kserve-container 2025-08-21 11:12:11.360 1 kserve INFO [utils.py:build_async_engine_client_from_engine_args():109] V1 AsyncLLM build complete
kserve-container 2025-08-21 11:12:11.387 1 kserve INFO [server.py:_register_endpoints():108] OpenAI endpoints registered
kserve-container 2025-08-21 11:12:11.387 1 kserve INFO [server.py:start():161] Starting uvicorn with 1 workers
kserve-container 2025-08-21 11:12:11.438 1 uvicorn.error INFO: Started server process [1]
kserve-container 2025-08-21 11:12:11.438 1 uvicorn.error INFO: Waiting for application startup.
kserve-container 2025-08-21 11:12:11.439 1 kserve INFO [server.py:start():70] Starting gRPC server with 4 workers
kserve-container 2025-08-21 11:12:11.439 1 kserve INFO [server.py:start():71] Starting gRPC server on [::]:8081
kserve-container 2025-08-21 11:12:11.439 1 uvicorn.error INFO: Application startup complete.
kserve-container 2025-08-21 11:12:11.440 1 uvicorn.error INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
kserve-container 2025-08-21 11:12:22.719 uvicorn.access INFO: 192.168.105.3:0 1 - "GET / HTTP/1.1" 200 OK
Here is final ISVC:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
finalizers:
- inferenceservice.finalizers
name: gpt-oss-20b
namespace: gpt-oss
spec:
predictor:
annotations:
serving.knative.dev/progress-deadline: 1740s
model:
modelFormat:
name: huggingface
name: ""
protocolVersion: v1
resources:
limits:
cpu: "16"
memory: "85899345920"
nvidia.com/gpu: "1"
requests:
cpu: "16"
memory: "85899345920"
nvidia.com/gpu: "1"
runtime: invu-huggingfaceserver
runtimeVersion: latest-gpu
storageUri: s3://cotype/3bd25443-bb89-46e3-b7f6-d90f07043043.zip
serviceAccountName: gpt-oss-svc-acc
status:
address:
url: http://gpt-oss-20b.gpt-oss.svc.cluster.local
components:
predictor:
address:
url: http://gpt-oss-20b-predictor.gpt-oss.svc.cluster.local
latestCreatedRevision: gpt-oss-20b-predictor-00001
latestReadyRevision: gpt-oss-20b-predictor-00001
latestRolledoutRevision: gpt-oss-20b-predictor-00001
traffic:
- latestRevision: true
percent: 100
revisionName: gpt-oss-20b-predictor-00001
url: <SECRET>
And runtime spec:
apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
annotations:
meta.helm.sh/release-name: inferencevalve-stack
meta.helm.sh/release-namespace: inferencevalve-stack
creationTimestamp: "2025-08-20T21:26:52Z"
generation: 2
labels:
app.kubernetes.io/instance: inferencevalve-stack
app.kubernetes.io/managed-by: Helm
app.kubernetes.io/name: inferencevalve-stack
app.kubernetes.io/version: ""
helm.sh/chart: inferencevalve-stack-0.3.1
name: invu-huggingfaceserver
resourceVersion: "37047071"
uid: e767e576-8702-4724-b9d5-15b17fbf7cd1
spec:
annotations:
prometheus.kserve.io/path: /metrics
prometheus.kserve.io/port: "8080"
containers:
- args:
- --model_name={{.Name}}
- --model_dir=/mnt/models
image: 78945789345654/hf-gpu-fix:latest-gpu
name: kserve-container
resources:
limits:
cpu: "1"
memory: 2Gi
requests:
cpu: "1"
memory: 2Gi
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
privileged: false
runAsNonRoot: true
protocolVersions:
- v2
- v1
supportedModelFormats:
- autoSelect: false
name: huggingface
version: "1"
Where 78945789345654/hf-gpu-fix:latest-gpu is https://hub.docker.com/repository/docker/78945789345654/hf-gpu-fix/tags/latest-gpu/sha256:b8c296f0e220a8d01a143c5d14ba33d0fc80375080673efea9657ee0bdc3a280
@kittywaresz I ran into this issue when trying to run gpt-oss-20b on kserve, so I tried to build the custom image the way you did but I see this in the pod logs:
kserve-container INFO 09-02 06:44:03 [__init__.py:245] No platform detected, vLLM is running on UnspecifiedPlatform
kserve-container WARNING 09-02 06:44:03 [_custom_ops.py:20] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory')
kserve-container Traceback (most recent call last):
kserve-container File "<frozen runpy>", line 198, in _run_module_as_main
kserve-container File "<frozen runpy>", line 88, in _run_code
kserve-container File "/kserve-workspace/huggingfaceserver/huggingfaceserver/__main__.py", line 164, in <module>
kserve-container parser = maybe_add_vllm_cli_parser(parser)
kserve-container ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
kserve-container File "/kserve-workspace/huggingfaceserver/huggingfaceserver/vllm/utils.py", line 63, in maybe_add_vllm_cli_parser
kserve-container return make_arg_parser(parser)
kserve-container ^^^^^^^^^^^^^^^^^^^^^^^
kserve-container File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/entrypoints/openai/cli_args.py", line 258, in make_arg_parser
kserve-container parser = AsyncEngineArgs.add_cli_args(parser)
kserve-container ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
kserve-container File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1747, in add_cli_args
kserve-container parser = EngineArgs.add_cli_args(parser)
kserve-container ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
kserve-container File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 835, in add_cli_args
kserve-container vllm_kwargs = get_kwargs(VllmConfig)
kserve-container ^^^^^^^^^^^^^^^^^^^^^^
kserve-container File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 261, in get_kwargs
kserve-container return copy.deepcopy(_compute_kwargs(cls))
kserve-container ^^^^^^^^^^^^^^^^^^^^
kserve-container File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 171, in _compute_kwargs
kserve-container default = field.default_factory()
kserve-container ^^^^^^^^^^^^^^^^^^^^^^^
kserve-container File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/pydantic/_internal/_dataclasses.py", line 123, in __init__
kserve-container s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
kserve-container File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/config/__init__.py", line 1882, in __post_init__
kserve-container raise RuntimeError(
kserve-container RuntimeError: Failed to infer device type, please set the environment variable `VLLM_LOGGING_LEVEL=DEBUG` to turn on verbose logging to help debug the issue.
Wondering if you also encountered this at one point. I'm running this on nodes with NVIDIA L4 GPUs and I've successfully served other models using the default image.
@marcelovilla it seems that your pod started on node with no drivers available
kserve-container INFO 09-02 06:44:03 [__init__.py:245] No platform detected, vLLM is running on UnspecifiedPlatform
kserve-container WARNING 09-02 06:44:03 [_custom_ops.py:20] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory')
I've received such message only when tried to run huggingface-runtime on my local machine with no cuda drivers
As you can see from my logs, in k8s pod it was always
INFO 08-21 08:29:01 [__init__.py:244] Automatically detected platform cuda.
When you serve the default huggingface-runtime image, do you receive a 'No platform detected' warning? I'm not sure if upgrading the transformers and vllm packages in my image changed the GPU inspection behavior
I would be happy to assist if you could provide more details about your GPU node setup (NVIDIA Device Plugin or NVIDIA GPU Operator versions) and the isvc manifest
@kittywaresz thanks for the help! It seemed to be a one time thing as redeploying and having a new node spawn got me past that error. It failed with the following, though:
kserve-container INFO 09-02 15:20:11 [__init__.py:241] Automatically detected platform cuda.
kserve-container WARNING 09-02 15:20:13 [__init__.py:1734] argument 'disable_log_requests' is deprecated
kserve-container 2025-09-02 15:20:13.246 1 kserve INFO [model_server.py:register_model():402] Registering model: gpt-oss-20b
kserve-container 2025-09-02 15:20:13.247 1 kserve INFO [model_server.py:setup_event_loop():282] Setting max asyncio worker threads as 12
kserve-container INFO 09-02 15:20:21 [__init__.py:711] Resolved architecture: GptOssForCausalLM
kserve-container torch_dtype is deprecated! Use dtype instead!
Parse safetensors files: 0%| | 0/3 [00:00<?, ?it/s]
Parse safetensors files: 33%|ββββ | 1/3 [00:00<00:00, 3.71it/s]
Parse safetensors files: 100%|ββββββββββ| 3/3 [00:00<00:00, 11.12it/s]
kserve-container INFO 09-02 15:20:22 [__init__.py:1750] Using max model len 131072
kserve-container WARNING 09-02 15:20:23 [__init__.py:1171] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
kserve-container INFO 09-02 15:20:24 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
kserve-container INFO 09-02 15:20:24 [config.py:273] Overriding max cuda graph capture size to 1024 for performance.
kserve-container INFO 09-02 15:20:32 [__init__.py:241] Automatically detected platform cuda.
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:34 [core.py:636] Waiting for init message from front-end.
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:34 [core.py:74] Initializing a V1 LLM engine (v0.10.1.1) with config:
model='openai/gpt-oss-20b',
speculative_config=None,
tokenizer='openai/gpt-oss-20b',
skip_tokenizer_init=False,
tokenizer_mode=auto,
revision=None,
override_neuron_config={},
tokenizer_revision=None,
trust_remote_code=False,
dtype=torch.bfloat16,
max_seq_len=131072,
download_dir='/mnt/models',
load_format=auto,
tensor_parallel_size=1,
pipeline_parallel_size=1,
disable_custom_all_reduce=False,
quantization=mxfp4,
enforce_eager=False,
kv_cache_dtype=auto,
device_config=cuda,
decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend='GptOss'),
observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None),
seed=0,
served_model_name=openai/gpt-oss-20b,
enable_prefix_caching=True,
chunked_prefill_enabled=True,
use_async_output_proc=True,
pooler_config=None,
compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[1024,1008,992,976,960,944,928,912,896,880,864,848,832,816,800,784,768,752,736,720,704,688,672,656,640,624,608,592,576,560,544,528,512,496,480,464,448,432,416,400,384,368,352,336,320,304,288,272,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],
cudagraph_copy_inputs=false,
full_cuda_graph=false,
pass_config={},
max_capture_size=1024,
local_cache_dir=null
kserve-container (EngineCore_0 pid=58) 2025-09-02 15:20:35,542 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:36 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:36 [topk_topp_sampler.py:50] Using FlashInfer for top-p & top-k sampling.
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:36 [gpu_model_runner.py:1953] Starting to load model openai/gpt-oss-20b...
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:37 [gpu_model_runner.py:1985] Loading model from scratch...
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:37 [cuda.py:323] Using Triton backend on V1 engine.
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:37 [triton_attn.py:257] Using vllm unified attention for TritonAttentionImpl
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:37 [weight_utils.py:296] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:
0% Completed | 0/3 [00:00<?, ?it/s]
33% Completed | 1/3 [00:01<00:02, 1.39s/it]
67% Completed | 2/3 [00:03<00:01, 1.58s/it]
100% Completed | 3/3 [00:04<00:00, 1.53s/it]
100% Completed | 3/3 [00:04<00:00, 1.53s/it]
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:42 [default_loader.py:262] Loading weights took 4.85 seconds
kserve-container (EngineCore_0 pid=58) WARNING 09-02 15:20:42 [marlin_utils_fp4.py:196] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:43 [gpu_model_runner.py:2007] Model loading took 13.7164 GiB and 6.253092 seconds
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:51 [backends.py:548] Using cache directory: /home/kserve/.cache/vllm/torch_compile_cache/bb6a9d19b5/rank_0_0/backbone for vLLM's torch.compile
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:51 [backends.py:559] Dynamo bytecode transform time: 7.64 s
kserve-container (EngineCore_0 pid=58) [rank0]:W0902 15:20:53.435000 58 prod_venv/lib/python3.12/site-packages/torch/_inductor/utils.py:1250] [0/0] Not enough SMs to use max_autotune_gemm mode
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:56 [backends.py:194] Cache the graph for dynamic shape for later use
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:21:37 [backends.py:215] Compiling a graph for dynamic shape takes 45.51 s
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:21:39 [marlin_utils.py:353] You are running Marlin kernel with bf16 on GPUs before SM90. You can consider change to fp16 to achieve better performance if possible.
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:21:50 [monitor.py:34] torch.compile takes 53.14 s in total
kserve-container (EngineCore_0 pid=58) Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 704, in run_engine_core
raise e
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 691, in run_engine_core
engine_core = EngineCoreProc(*args, **kwargs)
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 492, in __init__
super().__init__(vllm_config, executor_class, log_stats,
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 89, in __init__
self._initialize_kv_caches(vllm_config)
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 179, in _initialize_kv_caches
self.model_executor.determine_available_memory())
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
output = self.collective_rpc("determine_available_memory")
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
answer = run_method(self.driver_worker, method, args, kwargs)
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/utils/__init__.py", line 3007, in run_method
return func(*args, **kwargs)
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 244, in determine_available_memory
self.model_runner.profile_run()
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2627, in profile_run
output = self._dummy_sampler_run(last_hidden_states)
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2470, in _dummy_sampler_run
raise e
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2460, in _dummy_sampler_run
sampler_output = self.sampler(logits=logits,
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/sample/sampler.py", line 68, in forward
sampled = self.sample(logits, sampling_metadata)
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/sample/sampler.py", line 135, in sample
random_sampled = self.topk_topp_sampler(
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
return forward_call(*args, **kwargs)
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/sample/ops/topk_topp_sampler.py", line 109, in forward_cuda
return flashinfer_sample(logits.contiguous(), k, p, generators)
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/sample/ops/topk_topp_sampler.py", line 295, in flashinfer_sample
next_token_ids = flashinfer.sampling.top_k_top_p_sampling_from_logits(
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/flashinfer/sampling.py", line 806, in top_k_top_p_sampling_from_logits
masked_logits = top_k_mask_logits(logits, top_k)
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/flashinfer/sampling.py", line 1126, in top_k_mask_logits
return get_sampling_module().top_k_mask_logits(
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/flashinfer/sampling.py", line 36, in get_sampling_module
module = load_cuda_ops(
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/flashinfer/jit/core.py", line 135, in load_cuda_ops
torch_cpp_ext.load(
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1623, in load
return _jit_compile(
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2076, in _jit_compile
_write_ninja_file_and_build_library(
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2222, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2522, in _run_ninja_build
raise RuntimeError(message) from e
kserve-container (EngineCore_0 pid=58) RuntimeError: Error building extension 'sampling':
[1/4] /usr/local/cuda/bin/nvcc ... -c /.../sampling.cu -o sampling.cuda.o
FAILED: sampling.cuda.o
/bin/sh: 1: /usr/local/cuda/bin/nvcc: not found
[2/4] /usr/local/cuda/bin/nvcc ... -c /.../renorm.cu -o renorm.cuda.o
FAILED: renorm.cuda.o
/bin/sh: 1: /usr/local/cuda/bin/nvcc: not found
[3/4] /usr/local/cuda/bin/nvcc ... -c /.../flashinfer_sampling_ops.cu -o flashinfer_sampling_ops.cuda.o
FAILED: flashinfer_sampling_ops.cuda.o
/bin/sh: 1: /usr/local/cuda/bin/nvcc: not found
ninja: build stopped: subcommand failed.
kserve-container [rank0]:[W902 15:21:52.625339019 ProcessGroupNCCL.cpp:1479] Warning: destroy_process_group() was not called before program exit...
kserve-container 2025-09-02 15:21:53.447 1 kserve INFO [utils.py:build_async_engine_client_from_engine_args():109] V1 AsyncLLM build complete
kserve-container 2025-09-02 15:21:53.490 1 kserve INFO [server.py:_register_endpoints():108] OpenAI endpoints registered
kserve-container 2025-09-02 15:21:53.490 1 kserve INFO [server.py:start():161] Starting uvicorn with 1 workers
kserve-container 2025-09-02 15:21:53.575 1 kserve ERROR [__main__.py:<module>():322]
Failed to start model server: Engine core initialization failed. See root cause above. Failed core proc(s): {}
I set this env var on the InferenceService resource and got past that error and got it up and running.
env:
- name: VLLM_USE_FLASHINFER_SAMPLER
value: "0"
However, now I see this when hitting the OpenAI endpoints:
{"error":"TypeError : RequestLogger.log_inputs() got multiple values for argument 'params'"}
which seem to be server-side errors. I wonder if another file might need patching. Or maybe just disabling logging could make things work.
@kittywaresz love what you did. is there a way to update vLLM to the latest v0.10.2 release? Can you teach me what needs to be done?
@WinsonSou the main idea here is that the huggingface runtime implementation is fragile in the case of a vLLM upgrade because of tight coupling.
You can see it here, for example. Any non-backward-compatible changes to the OpenAIServingModels class signature will break the logic of the huggingface runtime. This applies to any class or function imported from the vLLM package, directly or indirectly.
Therefore, if you want to upgrade vLLM, you must ensure that every piece of logic in the huggingface runtime still works as expected.
In my case, I did this iteratively. The recipe for success is simple (but very time-consuming):
- Upgrade the vLLM package inside https://github.com/kserve/kserve/blob/master/python/huggingfaceserver/pyproject.toml and https://github.com/kserve/kserve/blob/master/python/kserve/pyproject.toml, since HuggingFaceServer depends on KServe.
- Make uv resolve the dependencies.
- Try to run HuggingFaceServer locally (you may need to tweak the source code to simplify environment requirements; for example, I commented out the piece that tries to determine a CUDA-capable platform).
- If you see any syntax or import errors in HuggingFaceServer, try to fix them and go to step β3.
- If there are no errors and the server has loaded your model and started locally, it's time to test it in the desired environment.
- Build a Docker image from the patched source code, upload it to the desired repo, and patch the
ClusterServingRuntimeresource namedkserve-huggingfaceserverwith the new image. - Create a testing InferenceService (isvc) that will use the
kserve-huggingfaceserverruntime. - Check the
kserve-containerlogs. If you see any errors, try to fix them in the source code and go to step β3. - If there are no errors and the server has loaded your model, try to perform inference.
- Check the
kserve-containerlogs again. If you see any errors, try to fix them in the source code and go to step β3. - If there are no errors, congratulations, you have finally done it.
This is a general idea. Your local environment or model requirements may be better than mine, and you might be able to skip certain steps.
Also, please note that this kind of solution works as an ad hoc solution. If it works with one model, there are no guarantees it will work with others, or on other platforms. As you can see, @marcelovilla did the same patching as I did but still faced some problems.
@kittywaresz this is golden! Thank you so much!! I will try it out and if it works i will comment here.