kserve icon indicating copy to clipboard operation
kserve copied to clipboard

It seems that `gpt_oss` model type is not supported by the latest HuggingFace runtime image

Open kittywaresz opened this issue 4 months ago β€’ 11 comments

/kind bug

What steps did you take and what happened: I tried to run https://huggingface.co/openai/gpt-oss-20b model using HuggingFace runtime (kserve/huggingfaceserver:latest-gpu and kserve/huggingfaceserver:v0.15.2-gpu) and received such exception inside runtime container:

INFO 08-21 08:13:19 [__init__.py:244] Automatically detected platform cuda.                                                                                                                                                            
Traceback (most recent call last):                                                                                                                                                                                                     
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/transformers/models/auto/configuration_auto.py", line 1131, in from_pretrained                                                                                        
    config_class = CONFIG_MAPPING[config_dict["model_type"]]                                                                                                                                                                           
                   ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                           
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/transformers/models/auto/configuration_auto.py", line 833, in __getitem__                                                                                             
    raise KeyError(key)                                                                                                                                                                                                                
KeyError: 'gpt_oss'                                                                                                                                                                                                                    
                                                                                                                                                                                                                                       
During handling of the above exception, another exception occurred:                                                                                                                                                                    
                                                                                                                                                                                                                                       
Traceback (most recent call last):                                                                                                                                                                                                     
  File "<frozen runpy>", line 198, in _run_module_as_main                                                                                                                                                                              
  File "<frozen runpy>", line 88, in _run_code                                                                                                                                                                                         
  File "/kserve-workspace/huggingfaceserver/huggingfaceserver/__main__.py", line 162, in <module>                                                                                                                                      
    if is_vllm_backend_enabled(initial_args, model_id_or_path):                                                                                                                                                                        
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                         
  File "/kserve-workspace/huggingfaceserver/huggingfaceserver/__main__.py", line 64, in is_vllm_backend_enabled                                                                                                                        
    and infer_vllm_supported_from_model_architecture(                                                                                                                                                                                  
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                  
  File "/kserve-workspace/huggingfaceserver/huggingfaceserver/vllm/utils.py", line 50, in infer_vllm_supported_from_model_architecture                                                                                                 
    model_config = AutoConfig.from_pretrained(                                                                                                                                                                                         
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                         
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/transformers/models/auto/configuration_auto.py", line 1133, in from_pretrained                                                                                        
    raise ValueError(                                                                                                                                                                                                                  
ValueError: The checkpoint you are trying to load has model type `gpt_oss` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out
 of date.                                                                                                                                                                                                                              
                                                                                                                                                                                                                                       
You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can g
et the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`

What did you expect to happen: Running model with no errors inside isvc's pod containers, as it was with https://huggingface.co/Qwen/Qwen2-0.5B for example

What's the InferenceService yaml: This is the version which I tried to use latest-gpu instead of v0.15.2-gpu (as before, but with the same result)

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  finalizers:
  - inferenceservice.finalizers
  name: gpt-oss-20b
  namespace: gpt-oss
spec:
  predictor:
    annotations:
      serving.knative.dev/progress-deadline: 1740s
    model:
      modelFormat:
        name: huggingface
      name: ""
      protocolVersion: v1
      resources:
        limits:
          cpu: "16"
          memory: "85899345920"
          nvidia.com/gpu: "1"
        requests:
          cpu: "16"
          memory: "85899345920"
          nvidia.com/gpu: "1"
      runtime: invu-huggingfaceserver
      runtimeVersion: latest-gpu
      storageUri: s3://cotype/6f3a1019-4e9e-46e3-b70e-d835e93a426f.zip
    serviceAccountName: gpt-oss-svc-acc
status:
  components:
    predictor:
      latestCreatedRevision: gpt-oss-20b-predictor-00001
  conditions:
  - lastTransitionTime: "2025-08-20T19:23:53Z"
    reason: PredictorConfigurationReady not ready
    severity: Info
    status: Unknown
    type: LatestDeploymentReady
  - lastTransitionTime: "2025-08-20T19:23:53Z"
    severity: Info
    status: Unknown
    type: PredictorConfigurationReady
  - lastTransitionTime: "2025-08-20T19:23:53Z"
    message: Configuration "gpt-oss-20b-predictor" is waiting for a Revision to become
      ready.
    reason: RevisionMissing
    status: Unknown
    type: PredictorReady
  - lastTransitionTime: "2025-08-20T19:23:53Z"
    message: Configuration "gpt-oss-20b-predictor" is waiting for a Revision to become
      ready.
    reason: RevisionMissing
    severity: Info
    status: Unknown
    type: PredictorRouteReady
  - lastTransitionTime: "2025-08-20T19:23:53Z"
    message: Configuration "gpt-oss-20b-predictor" is waiting for a Revision to become
      ready.
    reason: RevisionMissing
    status: Unknown
    type: Ready
  - lastTransitionTime: "2025-08-20T19:23:53Z"
    reason: PredictorRouteReady not ready
    severity: Info
    status: Unknown
    type: RoutesReady
  modelStatus:
    states:
      activeModelState: ""
      targetModelState: Pending
    transitionStatus: InProgress
  observedGeneration: 1

Anything else you would like to add: You can see custom ClusterServingRuntime invu-huggingfaceserver, it is just a copy of kserve-huggingfaceserver with image tag explicitly set as v0.15.2-gpu to pin specific version instead of latest-gpu

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: invu-huggingfaceserver
spec:
  annotations:
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: "8080"
  containers:
  - args:
    - --model_name={{.Name}}
    - --model_dir=/mnt/models
    image: kserve/huggingfaceserver:v0.15.2-gpu
    name: kserve-container
    resources:
      limits:
        cpu: "1"
        memory: 2Gi
      requests:
        cpu: "1"
        memory: 2Gi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      privileged: false
      runAsNonRoot: true
  protocolVersions:
  - v2
  - v1
  supportedModelFormats:
  - autoSelect: false
    name: huggingface
    version: "1"

Environment:

  • Istio Version: 1.25.2
  • Knative Version: v1.18.0
  • KServe Version: v0.15.0
  • Kubeflow version: -
  • Cloud Environment: -
  • Minikube/Kind version: -
  • Kubernetes version: (use kubectl version): Server Version: v1.32.4
  • OS (e.g. from /etc/os-release): Ubuntu 22.04.5 LTS

kittywaresz avatar Aug 20 '25 19:08 kittywaresz

this matches Problem No.7: provider / runtime model compatibility (unrecognized model type at the HF runtime). it’s a semantic firewall issue so you usually do not need wide infra changes; align the model type to a supported runtime image and use WFGY 3.0 to stabilize reasoning when mapping providers and model configs. if you want the checklist and sample env fixes, tell me and i will share the ProblemMap link.

onestardao avatar Aug 21 '25 03:08 onestardao

I also tried to specify --backend arg:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  finalizers:
  - inferenceservice.finalizers
  name: gpt-oss-20b
  namespace: gpt-oss
spec:
  predictor:
    annotations:
      serving.knative.dev/progress-deadline: 1740s
    model:
      args:
      - --backend=huggingface
      modelFormat:
        name: huggingface
      name: ""
      protocolVersion: v1
      resources:
        limits:
          cpu: "16"
          memory: "85899345920"
          nvidia.com/gpu: "0"
        requests:
          cpu: "16"
          memory: "85899345920"
          nvidia.com/gpu: "0"
      runtime: invu-huggingfaceserver
      runtimeVersion: latest-gpu
      storageUri: s3://cotype/3bd25443-bb89-46e3-b7f6-d90f07043043.zip
    serviceAccountName: gpt-oss-svc-acc
status:
  components:
    predictor:
      latestCreatedRevision: gpt-oss-20b-predictor-00001
  conditions:
  - lastTransitionTime: "2025-08-21T08:25:02Z"
    reason: PredictorConfigurationReady not ready
    severity: Info
    status: Unknown
    type: LatestDeploymentReady
  - lastTransitionTime: "2025-08-21T08:25:02Z"
    severity: Info
    status: Unknown
    type: PredictorConfigurationReady
  - lastTransitionTime: "2025-08-21T08:25:02Z"
    message: Configuration "gpt-oss-20b-predictor" is waiting for a Revision to become
      ready.
    reason: RevisionMissing
    status: Unknown
    type: PredictorReady
  - lastTransitionTime: "2025-08-21T08:25:02Z"
    message: Configuration "gpt-oss-20b-predictor" is waiting for a Revision to become
      ready.
    reason: RevisionMissing
    severity: Info
    status: Unknown
    type: PredictorRouteReady
  - lastTransitionTime: "2025-08-21T08:25:02Z"
    message: Configuration "gpt-oss-20b-predictor" is waiting for a Revision to become
      ready.
    reason: RevisionMissing
    status: Unknown
    type: Ready
  - lastTransitionTime: "2025-08-21T08:25:02Z"
    reason: PredictorRouteReady not ready
    severity: Info
    status: Unknown
    type: RoutesReady
  modelStatus:
    states:
      activeModelState: ""
      targetModelState: Pending
    transitionStatus: InProgress
  observedGeneration: 1

But received the same error:

INFO 08-21 08:29:01 [__init__.py:244] Automatically detected platform cuda.
2025-08-21 08:29:03.438 1 kserve INFO [storage.py:download():64] Copying contents of /mnt/models to local
2025-08-21 08:29:03.438 1 kserve INFO [storage.py:download():110] Successfully copied /mnt/models to None
2025-08-21 08:29:03.438 1 kserve INFO [storage.py:download():111] Model downloaded in 0.00044586299918591976 seconds.
2025-08-21 08:29:03.439 1 kserve ERROR [__main__.py:<module>():324] Failed to start model server: The checkpoint you are trying to load has model type `gpt_oss` but Transformers does not recognize this architecture. This could be b
ecause of an issue with the checkpoint, or because your version of Transformers is out of date.

You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can g
et the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git`
Traceback (most recent call last):
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/transformers/models/auto/configuration_auto.py", line 1131, in from_pretrained
    config_class = CONFIG_MAPPING[config_dict["model_type"]]
                   ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/transformers/models/auto/configuration_auto.py", line 833, in __getitem__
    raise KeyError(key)
KeyError: 'gpt_oss'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/kserve-workspace/huggingfaceserver/huggingfaceserver/__main__.py", line 314, in <module>
    model = load_model()
            ^^^^^^^^^^^^
  File "/kserve-workspace/huggingfaceserver/huggingfaceserver/__main__.py", line 234, in load_model
    model_config = AutoConfig.from_pretrained(
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/transformers/models/auto/configuration_auto.py", line 1133, in from_pretrained
    raise ValueError(
ValueError: The checkpoint you are trying to load has model type `gpt_oss` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out
 of date.

You can update Transformers with the command `pip install --upgrade transformers`. If this does not work, and the checkpoint is very new, then there may not be a release version that supports this model yet. In this case, you can g
et the most up-to-date code by installing Transformers from source with the command `pip install git+https://github.com/huggingface/transformers.git

kittywaresz avatar Aug 21 '25 08:08 kittywaresz

To resolve the issues, I tried to upgrade transformers and vllm python packages inside HuggingFace runtime image

Here is my Docker image πŸ˜„

FROM kserve/huggingfaceserver:latest-gpu

RUN pip install --upgrade transformers vllm
docker build -t 78945789345654/hf-gpu-fix:new-transformers-new-vllm .

docker run -it --rm --entrypoint=pip 78945789345654/hf-gpu-fix:new-transformers-new-vllm freeze | grep transformers
> transformers==4.55.2

docker run -it --rm --entrypoint=pip 78945789345654/hf-gpu-fix:new-transformers-new-vllm freeze | grep vllm
> vllm==0.10.1

And when I started my model locally, the error is gone:

docker run \
  --tty \
  --rm \
  --volume $PWD/gpt-oss-20b:/mnt/models \
  78945789345654/hf-gpu-fix:new-transformers-new-vllm \
  --model_name=gpt-oss-20b \
  --model_dir=/mnt/models

In comparison with, where the error is present:

docker run \
  --tty \
  --rm \
  --volume $PWD/gpt-oss-20b:/mnt/models \
  kserve/huggingfaceserver:latest-gpu \
  --model_name=gpt-oss-20b \
  --model_dir=/mnt/models

kittywaresz avatar Aug 21 '25 08:08 kittywaresz

So I decided to upload this image and use it inside k8s cluster:

ClusterServingRuntime:

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  name: invu-huggingfaceserver
spec:
  annotations:
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: "8080"
  containers:
  - args:
    - --model_name={{.Name}}
    - --model_dir=/mnt/models
    image: 78945789345654/hf-gpu-fix:latest-gpu  # <--- Here is my image
    name: kserve-container
    resources:
      limits:
        cpu: "1"
        memory: 2Gi
      requests:
        cpu: "1"
        memory: 2Gi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      privileged: false
      runAsNonRoot: true
  protocolVersions:
  - v2
  - v1
  supportedModelFormats:
  - autoSelect: false
    name: huggingface
    version: "1"

ISVC:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  finalizers:
  - inferenceservice.finalizers
  name: gpt-oss-20b
  namespace: gpt-oss
spec:
  predictor:
    annotations:
      serving.knative.dev/progress-deadline: 1740s
    model:
      modelFormat:
        name: huggingface
      name: ""
      protocolVersion: v1
      resources:
        limits:
          cpu: "16"
          memory: "85899345920"
          nvidia.com/gpu: "1"
        requests:
          cpu: "16"
          memory: "85899345920"
          nvidia.com/gpu: "1"
      runtime: invu-huggingfaceserver
      runtimeVersion: latest-gpu
      storageUri: s3://cotype/3bd25443-bb89-46e3-b7f6-d90f07043043.zip
    serviceAccountName: gpt-oss-svc-acc

And got new exception:

INFO 08-21 09:08:23 [__init__.py:241] Automatically detected platform cuda.
WARNING 08-21 09:08:24 [__init__.py:1734] argument 'disable_log_requests' is deprecated
2025-08-21 09:08:24.321 1 kserve INFO [storage.py:download():64] Copying contents of /mnt/models to local
2025-08-21 09:08:24.321 1 kserve INFO [storage.py:download():110] Successfully copied /mnt/models to None
2025-08-21 09:08:24.321 1 kserve INFO [storage.py:download():111] Model downloaded in 0.0001523587852716446 seconds.
2025-08-21 09:08:24.352 1 kserve INFO [model_server.py:register_model():406] Registering model: gpt-oss-20b
2025-08-21 09:08:24.353 1 kserve INFO [model_server.py:setup_event_loop():286] Setting max asyncio worker threads as 32
2025-08-21 09:08:24.379 1 kserve INFO [server.py:_register_endpoints():108] OpenAI endpoints registered
2025-08-21 09:08:24.379 1 kserve INFO [server.py:start():161] Starting uvicorn with 1 workers
2025-08-21 09:08:24.429 1 uvicorn.error INFO:     Started server process [1]
2025-08-21 09:08:24.429 1 uvicorn.error INFO:     Waiting for application startup.
2025-08-21 09:08:24.431 1 kserve INFO [server.py:start():70] Starting gRPC server with 4 workers
2025-08-21 09:08:24.432 1 kserve INFO [server.py:start():71] Starting gRPC server on [::]:8081
2025-08-21 09:08:24.432 1 uvicorn.error INFO:     Application startup complete.
2025-08-21 09:08:24.439 1 uvicorn.error ERROR:    Traceback (most recent call last):
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/kserve/model_server.py", line 301, in _serve
    await asyncio.gather(*self.servers)
  File "/kserve-workspace/huggingfaceserver/huggingfaceserver/vllm/vllm_model.py", line 101, in start_engine
    self.args.enable_reasoning
AttributeError: 'Namespace' object has no attribute 'enable_reasoning'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/starlette/routing.py", line 699, in lifespan
    await receive()
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/uvicorn/lifespan/on.py", line 137, in receive
    return await self.receive_queue.get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/queues.py", line 158, in get
    await getter
asyncio.exceptions.CancelledError
2025-08-21 09:08:24.439 1 kserve ERROR [__main__.py:<module>():324] Failed to start model server: 'Namespace' object has no attribute 'enable_reasoning'
Traceback (most recent call last):
  File "/kserve-workspace/huggingfaceserver/huggingfaceserver/__main__.py", line 320, in <module>
    model_server.start([model])
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/kserve/model_server.py", line 344, in start
    asyncio.run(self._serve())
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/kserve/model_server.py", line 301, in _serve
    await asyncio.gather(*self.servers)
  File "/kserve-workspace/huggingfaceserver/huggingfaceserver/vllm/vllm_model.py", line 101, in start_engine
    self.args.enable_reasoning
AttributeError: 'Namespace' object has no attribute 'enable_reasoning'
Error in sys.excepthook:

Original exception was:

kittywaresz avatar Aug 21 '25 09:08 kittywaresz

So I went deeper and tried to change runtime source code to respect new vllm package requirements, as --enable-reasoning became deprecated and it no longer may be parsed from args and prompt_adapters became stale, as I can see

Here is my new image:

FROM kserve/huggingfaceserver:latest-gpu

RUN pip install --upgrade transformers vllm

COPY ./patched_vllm_model.py /kserve-workspace/huggingfaceserver/huggingfaceserver/vllm/vllm_model.py

You can see patched_vllm_model.py here https://github.com/kserve/kserve/pull/4659/files#diff-7df5106c0f3e1672869e494b083850402bacf0f43e48e960f9935f2d93accdfe

And after launching this image I got working runtime:

kserve-container INFO 08-21 11:10:51 [__init__.py:241] Automatically detected platform cuda.
kserve-container WARNING 08-21 11:10:52 [__init__.py:1734] argument 'disable_log_requests' is deprecated
kserve-container 2025-08-21 11:10:52.469 1 kserve INFO [storage.py:download():64] Copying contents of /mnt/models to local
kserve-container 2025-08-21 11:10:52.469 1 kserve INFO [storage.py:download():110] Successfully copied /mnt/models to None
kserve-container 2025-08-21 11:10:52.469 1 kserve INFO [storage.py:download():111] Model downloaded in 0.00015321210958063602 seconds.
kserve-container 2025-08-21 11:10:52.501 1 kserve INFO [model_server.py:register_model():406] Registering model: gpt-oss-20b
kserve-container 2025-08-21 11:10:52.502 1 kserve INFO [model_server.py:setup_event_loop():286] Setting max asyncio worker threads as 32
kserve-container INFO 08-21 11:10:57 [__init__.py:711] Resolved architecture: GptOssForCausalLM
kserve-container ERROR 08-21 11:10:57 [config.py:130] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/models'. Use `repo_type` argument if needed., retrying 1 of 2
kserve-container ERROR 08-21 11:10:59 [config.py:128] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/mnt/models'. Use `repo_type` argument if needed.
kserve-container INFO 08-21 11:10:59 [__init__.py:2816] Downcasting torch.float32 to torch.bfloat16.
kserve-container INFO 08-21 11:10:59 [__init__.py:1750] Using max model len 131072
kserve-container WARNING 08-21 11:11:00 [__init__.py:1171] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
kserve-container INFO 08-21 11:11:01 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
kserve-container INFO 08-21 11:11:01 [config.py:273] Overriding max cuda graph capture size to 1024 for performance.
kserve-container INFO 08-21 11:11:05 [__init__.py:241] Automatically detected platform cuda.
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:07 [core.py:636] Waiting for init message from front-end.
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:07 [core.py:74] Initializing a V1 LLM engine (v0.10.1) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenize
r_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=
1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_addi
tional_properties=False, reasoning_backend='GptOss'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/mnt/models, e
nable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_
ttention","vllm.unified_attention_with_output","vllm.mamba_mixer2"[],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":t
rue,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[1024,1008,992,976,960,944,928,912,896,880,864,848,832,816,800,784,768,752,736,720,704,688,672,656,640,624,608,592,576,560,544,528,512,496,480,464,448,432,416,400,384,368,3
2,336,320,304,288,272,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_si
ze":1024,"local_cache_dir":null}
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:08 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:08 [topk_topp_sampler.py:50] Using FlashInfer for top-p & top-k sampling.
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:08 [gpu_model_runner.py:1953] Starting to load model /mnt/models...
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:08 [gpu_model_runner.py:1985] Loading model from scratch...
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:08 [cuda.py:323] Using Triton backend on V1 engine.
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:08 [triton_attn.py:257] Using vllm unified attention for TritonAttentionImpl
kserve-container (EngineCore_0 pid=144) Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
kserve-container (EngineCore_0 pid=144) Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:03<00:07,  3.74s/it]
kserve-container (EngineCore_0 pid=144) Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:08<00:04,  4.50s/it]
kserve-container (EngineCore_0 pid=144) Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:13<00:00,  4.70s/it]
kserve-container (EngineCore_0 pid=144) Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:13<00:00,  4.57s/it]
kserve-container (EngineCore_0 pid=144)
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:22 [default_loader.py:262] Loading weights took 13.78 seconds
kserve-container (EngineCore_0 pid=144) WARNING 08-21 11:11:22 [marlin_utils_fp4.py:196] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leverag
ing the Marlin kernel. This may degrade performance for compute-heavy workloads.
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:23 [gpu_model_runner.py:2007] Model loading took 13.7194 GiB and 14.309837 seconds
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:28 [backends.py:548] Using cache directory: /home/kserve/.cache/vllm/torch_compile_cache/3a3580bf2a/rank_0_0/backbone for vLLM's torch.compile
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:28 [backends.py:559] Dynamo bytecode transform time: 5.36 s
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:31 [backends.py:194] Cache the graph for dynamic shape for later use
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:11:59 [backends.py:215] Compiling a graph for dynamic shape takes 29.77 s
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:12:00 [marlin_utils.py:353] You are running Marlin kernel with bf16 on GPUs before SM90. You can consider change to fp16 to achieve better performance if possible.
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:12:02 [monitor.py:34] torch.compile takes 35.13 s in total
kserve-container (EngineCore_0 pid=144) /kserve-workspace/prod_venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py:2356: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilat
ion.
kserve-container (EngineCore_0 pid=144) If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
kserve-container (EngineCore_0 pid=144)   warnings.warn(
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:12:02 [gpu_worker.py:276] Available KV cache memory: 56.90 GiB
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:12:03 [kv_cache_utils.py:1013] GPU KV cache size: 1,242,912 tokens
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:12:03 [kv_cache_utils.py:1017] Maximum concurrency for 131,072 tokens per request: 18.65x
kserve-container (EngineCore_0 pid=144) Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0%|          | 0/83 [00:00<?, ?it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   2%|▏         | 2/83 [00:00<00:06,
 12.64it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   5%|▍         | 4/83 [00:00<00:05, 13.49it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   7%|β–‹         | 6/83 [00:00<00:05, 14.13it/s]Capturi
ng CUDA graphs (mixed prefill-decode, PIECEWISE):  10%|β–‰         | 8/83 [00:00<00:05, 14.47it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  12%|β–ˆβ–        | 10/83 [00:00<00:04, 14.80it/s]Capturing CUDA graphs (
mixed prefill-decode, PIECEWISE):  14%|β–ˆβ–        | 12/83 [00:00<00:04, 15.02it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  17%|β–ˆβ–‹        | 14/83 [00:00<00:04, 15.29it/s]Capturing CUDA graphs (mixed prefill
-decode, PIECEWISE):  19%|β–ˆβ–‰        | 16/83 [00:01<00:04, 15.53it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  22%|β–ˆβ–ˆβ–       | 18/83 [00:01<00:04, 15.79it/s]Capturing CUDA graphs (mixed prefill-decode, PI
ECEWISE):  24%|β–ˆβ–ˆβ–       | 20/83 [00:01<00:03, 16.08it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  27%|β–ˆβ–ˆβ–‹       | 22/83 [00:01<00:03, 16.45it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):
  29%|β–ˆβ–ˆβ–‰       | 24/83 [00:01<00:03, 16.68it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  31%|β–ˆβ–ˆβ–ˆβ–      | 26/83 [00:01<00:03, 17.16it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  34%|οΏ½
οΏ½οΏ½β–ˆβ–ˆβ–Ž      | 28/83 [00:01<00:03, 17.51it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  36%|β–ˆβ–ˆβ–ˆβ–Œ      | 30/83 [00:01<00:02, 18.07it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  39%|β–ˆβ–ˆ
β–ˆβ–Š      | 32/83 [00:01<00:02, 18.29it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  42%|β–ˆβ–ˆβ–ˆβ–ˆβ–     | 35/83 [00:02<00:02, 19.10it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  46%|β–ˆβ–ˆβ–ˆ
β–ˆβ–Œ     | 38/83 [00:02<00:02, 19.79it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  49%|β–ˆβ–ˆβ–ˆβ–ˆβ–‰     | 41/83 [00:02<00:02, 20.28it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  53%|β–ˆβ–ˆβ–ˆοΏ½
οΏ½οΏ½β–ˆβ–Ž    | 44/83 [00:02<00:01, 21.00it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  57%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹    | 47/83 [00:02<00:01, 21.81it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  60%|β–ˆβ–ˆοΏ½
οΏ½οΏ½β–ˆβ–ˆβ–ˆ    | 50/83 [00:02<00:01, 22.56it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  64%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–   | 53/83 [00:02<00:01, 23.31it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  67%|οΏ½οΏ½
οΏ½β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹   | 56/83 [00:03<00:01, 23.86it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  71%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ   | 59/83 [00:03<00:00, 24.54it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):
 75%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 62/83 [00:03<00:00, 24.93it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  78%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š  | 65/83 [00:03<00:00, 25.16it/s]Capturing CUDA graphs (mixed prefill-decode, P
IECEWISE):  82%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ– | 68/83 [00:03<00:00, 25.30it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  86%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 71/83 [00:03<00:00, 25.42it/s]Capturing CUDA graphs (mixed pr
efill-decode, PIECEWISE):  89%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰ | 74/83 [00:03<00:00, 25.45it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  93%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž| 77/83 [00:03<00:00, 25.21it/s]Capturing CUDA
 graphs (mixed prefill-decode, PIECEWISE):  96%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‹| 80/83 [00:03<00:00, 25.46it/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 83/83 [00:04<00:00, 25.54i
t/s]Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 83/83 [00:04<00:00, 20.37it/s]
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:12:07 [gpu_model_runner.py:2708] Graph capturing finished in 4 secs, took 0.79 GiB
kserve-container (EngineCore_0 pid=144) INFO 08-21 11:12:07 [core.py:214] init engine (profile, create kv cache, warmup model) took 44.21 seconds
kserve-container INFO 08-21 11:12:09 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 155365
kserve-container 2025-08-21 11:12:11.360 1 kserve INFO [utils.py:build_async_engine_client_from_engine_args():109] V1 AsyncLLM build complete
kserve-container 2025-08-21 11:12:11.387 1 kserve INFO [server.py:_register_endpoints():108] OpenAI endpoints registered
kserve-container 2025-08-21 11:12:11.387 1 kserve INFO [server.py:start():161] Starting uvicorn with 1 workers
kserve-container 2025-08-21 11:12:11.438 1 uvicorn.error INFO:     Started server process [1]
kserve-container 2025-08-21 11:12:11.438 1 uvicorn.error INFO:     Waiting for application startup.
kserve-container 2025-08-21 11:12:11.439 1 kserve INFO [server.py:start():70] Starting gRPC server with 4 workers
kserve-container 2025-08-21 11:12:11.439 1 kserve INFO [server.py:start():71] Starting gRPC server on [::]:8081
kserve-container 2025-08-21 11:12:11.439 1 uvicorn.error INFO:     Application startup complete.
kserve-container 2025-08-21 11:12:11.440 1 uvicorn.error INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
kserve-container 2025-08-21 11:12:22.719 uvicorn.access INFO:     192.168.105.3:0 1 - "GET / HTTP/1.1" 200 OK

Here is final ISVC:

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  finalizers:
  - inferenceservice.finalizers
  name: gpt-oss-20b
  namespace: gpt-oss
spec:
  predictor:
    annotations:
      serving.knative.dev/progress-deadline: 1740s
    model:
      modelFormat:
        name: huggingface
      name: ""
      protocolVersion: v1
      resources:
        limits:
          cpu: "16"
          memory: "85899345920"
          nvidia.com/gpu: "1"
        requests:
          cpu: "16"
          memory: "85899345920"
          nvidia.com/gpu: "1"
      runtime: invu-huggingfaceserver
      runtimeVersion: latest-gpu
      storageUri: s3://cotype/3bd25443-bb89-46e3-b7f6-d90f07043043.zip
    serviceAccountName: gpt-oss-svc-acc
status:
  address:
    url: http://gpt-oss-20b.gpt-oss.svc.cluster.local
  components:
    predictor:
      address:
        url: http://gpt-oss-20b-predictor.gpt-oss.svc.cluster.local
      latestCreatedRevision: gpt-oss-20b-predictor-00001
      latestReadyRevision: gpt-oss-20b-predictor-00001
      latestRolledoutRevision: gpt-oss-20b-predictor-00001
      traffic:
      - latestRevision: true
        percent: 100
        revisionName: gpt-oss-20b-predictor-00001
      url: <SECRET>

And runtime spec:

apiVersion: serving.kserve.io/v1alpha1
kind: ClusterServingRuntime
metadata:
  annotations:
    meta.helm.sh/release-name: inferencevalve-stack
    meta.helm.sh/release-namespace: inferencevalve-stack
  creationTimestamp: "2025-08-20T21:26:52Z"
  generation: 2
  labels:
    app.kubernetes.io/instance: inferencevalve-stack
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: inferencevalve-stack
    app.kubernetes.io/version: ""
    helm.sh/chart: inferencevalve-stack-0.3.1
  name: invu-huggingfaceserver
  resourceVersion: "37047071"
  uid: e767e576-8702-4724-b9d5-15b17fbf7cd1
spec:
  annotations:
    prometheus.kserve.io/path: /metrics
    prometheus.kserve.io/port: "8080"
  containers:
  - args:
    - --model_name={{.Name}}
    - --model_dir=/mnt/models
    image: 78945789345654/hf-gpu-fix:latest-gpu
    name: kserve-container
    resources:
      limits:
        cpu: "1"
        memory: 2Gi
      requests:
        cpu: "1"
        memory: 2Gi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      privileged: false
      runAsNonRoot: true
  protocolVersions:
  - v2
  - v1
  supportedModelFormats:
  - autoSelect: false
    name: huggingface
    version: "1"

Where 78945789345654/hf-gpu-fix:latest-gpu is https://hub.docker.com/repository/docker/78945789345654/hf-gpu-fix/tags/latest-gpu/sha256:b8c296f0e220a8d01a143c5d14ba33d0fc80375080673efea9657ee0bdc3a280

kittywaresz avatar Aug 21 '25 11:08 kittywaresz

@kittywaresz I ran into this issue when trying to run gpt-oss-20b on kserve, so I tried to build the custom image the way you did but I see this in the pod logs:

kserve-container INFO 09-02 06:44:03 [__init__.py:245] No platform detected, vLLM is running on UnspecifiedPlatform
kserve-container WARNING 09-02 06:44:03 [_custom_ops.py:20] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory')
kserve-container Traceback (most recent call last):
kserve-container   File "<frozen runpy>", line 198, in _run_module_as_main
kserve-container   File "<frozen runpy>", line 88, in _run_code
kserve-container   File "/kserve-workspace/huggingfaceserver/huggingfaceserver/__main__.py", line 164, in <module>
kserve-container     parser = maybe_add_vllm_cli_parser(parser)
kserve-container              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
kserve-container   File "/kserve-workspace/huggingfaceserver/huggingfaceserver/vllm/utils.py", line 63, in maybe_add_vllm_cli_parser
kserve-container     return make_arg_parser(parser)
kserve-container            ^^^^^^^^^^^^^^^^^^^^^^^
kserve-container   File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/entrypoints/openai/cli_args.py", line 258, in make_arg_parser
kserve-container     parser = AsyncEngineArgs.add_cli_args(parser)
kserve-container              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
kserve-container   File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1747, in add_cli_args
kserve-container     parser = EngineArgs.add_cli_args(parser)
kserve-container              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
kserve-container   File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 835, in add_cli_args
kserve-container     vllm_kwargs = get_kwargs(VllmConfig)
kserve-container                   ^^^^^^^^^^^^^^^^^^^^^^
kserve-container   File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 261, in get_kwargs
kserve-container     return copy.deepcopy(_compute_kwargs(cls))
kserve-container                          ^^^^^^^^^^^^^^^^^^^^
kserve-container   File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 171, in _compute_kwargs
kserve-container     default = field.default_factory()
kserve-container               ^^^^^^^^^^^^^^^^^^^^^^^
kserve-container   File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/pydantic/_internal/_dataclasses.py", line 123, in __init__
kserve-container     s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
kserve-container   File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/config/__init__.py", line 1882, in __post_init__
kserve-container     raise RuntimeError(
kserve-container RuntimeError: Failed to infer device type, please set the environment variable `VLLM_LOGGING_LEVEL=DEBUG` to turn on verbose logging to help debug the issue.

Wondering if you also encountered this at one point. I'm running this on nodes with NVIDIA L4 GPUs and I've successfully served other models using the default image.

marcelovilla avatar Sep 02 '25 14:09 marcelovilla

@marcelovilla it seems that your pod started on node with no drivers available

kserve-container INFO 09-02 06:44:03 [__init__.py:245] No platform detected, vLLM is running on UnspecifiedPlatform
kserve-container WARNING 09-02 06:44:03 [_custom_ops.py:20] Failed to import from vllm._C with ImportError('libcuda.so.1: cannot open shared object file: No such file or directory')

I've received such message only when tried to run huggingface-runtime on my local machine with no cuda drivers

As you can see from my logs, in k8s pod it was always

INFO 08-21 08:29:01 [__init__.py:244] Automatically detected platform cuda.

When you serve the default huggingface-runtime image, do you receive a 'No platform detected' warning? I'm not sure if upgrading the transformers and vllm packages in my image changed the GPU inspection behavior

I would be happy to assist if you could provide more details about your GPU node setup (NVIDIA Device Plugin or NVIDIA GPU Operator versions) and the isvc manifest

kittywaresz avatar Sep 02 '25 15:09 kittywaresz

@kittywaresz thanks for the help! It seemed to be a one time thing as redeploying and having a new node spawn got me past that error. It failed with the following, though:

kserve-container INFO 09-02 15:20:11 [__init__.py:241] Automatically detected platform cuda.
kserve-container WARNING 09-02 15:20:13 [__init__.py:1734] argument 'disable_log_requests' is deprecated
kserve-container 2025-09-02 15:20:13.246 1 kserve INFO [model_server.py:register_model():402] Registering model: gpt-oss-20b
kserve-container 2025-09-02 15:20:13.247 1 kserve INFO [model_server.py:setup_event_loop():282] Setting max asyncio worker threads as 12
kserve-container INFO 09-02 15:20:21 [__init__.py:711] Resolved architecture: GptOssForCausalLM
kserve-container torch_dtype is deprecated! Use dtype instead!

Parse safetensors files:   0%|          | 0/3 [00:00<?, ?it/s]
Parse safetensors files:  33%|β–ˆβ–ˆβ–ˆβ–Ž      | 1/3 [00:00<00:00,  3.71it/s]
Parse safetensors files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:00<00:00, 11.12it/s]

kserve-container INFO 09-02 15:20:22 [__init__.py:1750] Using max model len 131072
kserve-container WARNING 09-02 15:20:23 [__init__.py:1171] mxfp4 quantization is not fully optimized yet. The speed can be slower than non-quantized models.
kserve-container INFO 09-02 15:20:24 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
kserve-container INFO 09-02 15:20:24 [config.py:273] Overriding max cuda graph capture size to 1024 for performance.
kserve-container INFO 09-02 15:20:32 [__init__.py:241] Automatically detected platform cuda.

kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:34 [core.py:636] Waiting for init message from front-end.
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:34 [core.py:74] Initializing a V1 LLM engine (v0.10.1.1) with config:
  model='openai/gpt-oss-20b',
  speculative_config=None,
  tokenizer='openai/gpt-oss-20b',
  skip_tokenizer_init=False,
  tokenizer_mode=auto,
  revision=None,
  override_neuron_config={},
  tokenizer_revision=None,
  trust_remote_code=False,
  dtype=torch.bfloat16,
  max_seq_len=131072,
  download_dir='/mnt/models',
  load_format=auto,
  tensor_parallel_size=1,
  pipeline_parallel_size=1,
  disable_custom_all_reduce=False,
  quantization=mxfp4,
  enforce_eager=False,
  kv_cache_dtype=auto,
  device_config=cuda,
  decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend='GptOss'),
  observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None),
  seed=0,
  served_model_name=openai/gpt-oss-20b,
  enable_prefix_caching=True,
  chunked_prefill_enabled=True,
  use_async_output_proc=True,
  pooler_config=None,
  compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2"],"use_inductor":true,"compile_sizes":],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[1024,1008,992,976,960,944,928,912,896,880,864,848,832,816,800,784,768,752,736,720,704,688,672,656,640,624,608,592,576,560,544,528,512,496,480,464,448,432,416,400,384,368,352,336,320,304,288,272,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],
  cudagraph_copy_inputs=false,
  full_cuda_graph=false,
  pass_config={},
  max_capture_size=1024,
  local_cache_dir=null

kserve-container (EngineCore_0 pid=58) 2025-09-02 15:20:35,542 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:36 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:36 [topk_topp_sampler.py:50] Using FlashInfer for top-p & top-k sampling.
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:36 [gpu_model_runner.py:1953] Starting to load model openai/gpt-oss-20b...
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:37 [gpu_model_runner.py:1985] Loading model from scratch...
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:37 [cuda.py:323] Using Triton backend on V1 engine.
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:37 [triton_attn.py:257] Using vllm unified attention for TritonAttentionImpl
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:37 [weight_utils.py:296] Using model weights format ['*.safetensors']

Loading safetensors checkpoint shards:
  0% Completed | 0/3 [00:00<?, ?it/s]
 33% Completed | 1/3 [00:01<00:02,  1.39s/it]
 67% Completed | 2/3 [00:03<00:01,  1.58s/it]
100% Completed | 3/3 [00:04<00:00,  1.53s/it]
100% Completed | 3/3 [00:04<00:00,  1.53s/it]

kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:42 [default_loader.py:262] Loading weights took 4.85 seconds
kserve-container (EngineCore_0 pid=58) WARNING 09-02 15:20:42 [marlin_utils_fp4.py:196] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:43 [gpu_model_runner.py:2007] Model loading took 13.7164 GiB and 6.253092 seconds
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:51 [backends.py:548] Using cache directory: /home/kserve/.cache/vllm/torch_compile_cache/bb6a9d19b5/rank_0_0/backbone for vLLM's torch.compile
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:51 [backends.py:559] Dynamo bytecode transform time: 7.64 s
kserve-container (EngineCore_0 pid=58) [rank0]:W0902 15:20:53.435000 58 prod_venv/lib/python3.12/site-packages/torch/_inductor/utils.py:1250] [0/0] Not enough SMs to use max_autotune_gemm mode
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:20:56 [backends.py:194] Cache the graph for dynamic shape for later use
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:21:37 [backends.py:215] Compiling a graph for dynamic shape takes 45.51 s
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:21:39 [marlin_utils.py:353] You are running Marlin kernel with bf16 on GPUs before SM90. You can consider change to fp16 to achieve better performance if possible.
kserve-container (EngineCore_0 pid=58) INFO 09-02 15:21:50 [monitor.py:34] torch.compile takes 53.14 s in total

kserve-container (EngineCore_0 pid=58) Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 704, in run_engine_core
    raise e
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 691, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 492, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 89, in __init__
    self._initialize_kv_caches(vllm_config)
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 179, in _initialize_kv_caches
    self.model_executor.determine_available_memory())
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 76, in determine_available_memory
    output = self.collective_rpc("determine_available_memory")
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/utils/__init__.py", line 3007, in run_method
    return func(*args, **kwargs)
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 244, in determine_available_memory
    self.model_runner.profile_run()
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2627, in profile_run
    output = self._dummy_sampler_run(last_hidden_states)
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2470, in _dummy_sampler_run
    raise e
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 2460, in _dummy_sampler_run
    sampler_output = self.sampler(logits=logits,
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/sample/sampler.py", line 68, in forward
    sampled = self.sample(logits, sampling_metadata)
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/sample/sampler.py", line 135, in sample
    random_sampled = self.topk_topp_sampler(
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1762, in _call_impl
    return forward_call(*args, **kwargs)
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/sample/ops/topk_topp_sampler.py", line 109, in forward_cuda
    return flashinfer_sample(logits.contiguous(), k, p, generators)
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/vllm/v1/sample/ops/topk_topp_sampler.py", line 295, in flashinfer_sample
    next_token_ids = flashinfer.sampling.top_k_top_p_sampling_from_logits(
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/flashinfer/sampling.py", line 806, in top_k_top_p_sampling_from_logits
    masked_logits = top_k_mask_logits(logits, top_k)
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/flashinfer/sampling.py", line 1126, in top_k_mask_logits
    return get_sampling_module().top_k_mask_logits(
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/flashinfer/sampling.py", line 36, in get_sampling_module
    module = load_cuda_ops(
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/flashinfer/jit/core.py", line 135, in load_cuda_ops
    torch_cpp_ext.load(
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1623, in load
    return _jit_compile(
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2076, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2222, in _write_ninja_file_and_build_library
    _run_ninja_build(
  File "/kserve-workspace/prod_venv/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 2522, in _run_ninja_build
    raise RuntimeError(message) from e

kserve-container (EngineCore_0 pid=58) RuntimeError: Error building extension 'sampling':
  [1/4] /usr/local/cuda/bin/nvcc ... -c /.../sampling.cu -o sampling.cuda.o
  FAILED: sampling.cuda.o
  /bin/sh: 1: /usr/local/cuda/bin/nvcc: not found

  [2/4] /usr/local/cuda/bin/nvcc ... -c /.../renorm.cu -o renorm.cuda.o
  FAILED: renorm.cuda.o
  /bin/sh: 1: /usr/local/cuda/bin/nvcc: not found

  [3/4] /usr/local/cuda/bin/nvcc ... -c /.../flashinfer_sampling_ops.cu -o flashinfer_sampling_ops.cuda.o
  FAILED: flashinfer_sampling_ops.cuda.o
  /bin/sh: 1: /usr/local/cuda/bin/nvcc: not found

  ninja: build stopped: subcommand failed.

kserve-container [rank0]:[W902 15:21:52.625339019 ProcessGroupNCCL.cpp:1479] Warning: destroy_process_group() was not called before program exit...
kserve-container 2025-09-02 15:21:53.447 1 kserve INFO [utils.py:build_async_engine_client_from_engine_args():109] V1 AsyncLLM build complete
kserve-container 2025-09-02 15:21:53.490 1 kserve INFO [server.py:_register_endpoints():108] OpenAI endpoints registered
kserve-container 2025-09-02 15:21:53.490 1 kserve INFO [server.py:start():161] Starting uvicorn with 1 workers

kserve-container 2025-09-02 15:21:53.575 1 kserve ERROR [__main__.py:<module>():322]
  Failed to start model server: Engine core initialization failed. See root cause above. Failed core proc(s): {}

I set this env var on the InferenceService resource and got past that error and got it up and running.

      env:
        - name: VLLM_USE_FLASHINFER_SAMPLER
          value: "0"

However, now I see this when hitting the OpenAI endpoints:

{"error":"TypeError : RequestLogger.log_inputs() got multiple values for argument 'params'"}

which seem to be server-side errors. I wonder if another file might need patching. Or maybe just disabling logging could make things work.

marcelovilla avatar Sep 02 '25 16:09 marcelovilla

@kittywaresz love what you did. is there a way to update vLLM to the latest v0.10.2 release? Can you teach me what needs to be done?

WinsonSou avatar Sep 20 '25 01:09 WinsonSou

@WinsonSou the main idea here is that the huggingface runtime implementation is fragile in the case of a vLLM upgrade because of tight coupling.

You can see it here, for example. Any non-backward-compatible changes to the OpenAIServingModels class signature will break the logic of the huggingface runtime. This applies to any class or function imported from the vLLM package, directly or indirectly.

Therefore, if you want to upgrade vLLM, you must ensure that every piece of logic in the huggingface runtime still works as expected.

In my case, I did this iteratively. The recipe for success is simple (but very time-consuming):

  1. Upgrade the vLLM package inside https://github.com/kserve/kserve/blob/master/python/huggingfaceserver/pyproject.toml and https://github.com/kserve/kserve/blob/master/python/kserve/pyproject.toml, since HuggingFaceServer depends on KServe.
  2. Make uv resolve the dependencies.
  3. Try to run HuggingFaceServer locally (you may need to tweak the source code to simplify environment requirements; for example, I commented out the piece that tries to determine a CUDA-capable platform).
  4. If you see any syntax or import errors in HuggingFaceServer, try to fix them and go to step β„–3.
  5. If there are no errors and the server has loaded your model and started locally, it's time to test it in the desired environment.
  6. Build a Docker image from the patched source code, upload it to the desired repo, and patch the ClusterServingRuntime resource named kserve-huggingfaceserver with the new image.
  7. Create a testing InferenceService (isvc) that will use the kserve-huggingfaceserver runtime.
  8. Check the kserve-container logs. If you see any errors, try to fix them in the source code and go to step β„–3.
  9. If there are no errors and the server has loaded your model, try to perform inference.
  10. Check the kserve-container logs again. If you see any errors, try to fix them in the source code and go to step β„–3.
  11. If there are no errors, congratulations, you have finally done it.

This is a general idea. Your local environment or model requirements may be better than mine, and you might be able to skip certain steps.

Also, please note that this kind of solution works as an ad hoc solution. If it works with one model, there are no guarantees it will work with others, or on other platforms. As you can see, @marcelovilla did the same patching as I did but still faced some problems.

kittywaresz avatar Sep 20 '25 19:09 kittywaresz

@kittywaresz this is golden! Thank you so much!! I will try it out and if it works i will comment here.

WinsonSou avatar Sep 21 '25 00:09 WinsonSou