openvino icon indicating copy to clipboard operation
openvino copied to clipboard

[Bug]: OpenVINO Model Server 2025.3 BAREMETAL crashes runing on GPU Intel UHD 620

Open waltersh opened this issue 3 weeks ago • 1 comments

OpenVINO Version

2025.3

Operating System

Windows System

Device used for inference

GPU

Framework

None

Model used

OpenVINO/Phi-3.5-mini-instruct-int4-ov

Issue description

After following the step by step installation process and while following the "QuickStart - LLM Models" guide I've found this problem. When started the server whith OpenVINO/Phi-3.5-mini-instruct-int4-ov running on CPU, everything goes well and runs smoothly. But then, when I start the server with the same model but with "target_device" set to GPU, iy doesn't goes well. After model is loades, I can make the rest request on v1 endpoint to get configuration (whith the request provided in the guide) and it shows the expected json. But when I call the v3 endpoint with the example request provided, it makes crash the server. Nothing is shown in the terminal, but the server stops without response. No mather if I use CMD or PowerShell, the behavior remains the same: On CPU works fine, on GPU crashes.

Step-by-step reproduction

Run command to get the binary curl -L https://github.com/openvinotoolkit/model_server/releases/download/v2025.3/ovms_windows_python_on.zip -o ovms.zip tar -xf ovms.zip

Then run command to start the server, in my case I used the command line option .\ovms\setupvars.bat

Then run the command to run the server ovms.exe --source_model OpenVINO/Phi-3.5-mini-instruct-int4-ov --model_repository_path models --rest_port 8000 --task text_generation --target_device GPU --cache_size 2

in another command line run the config test curl http://localhost:8000/v1/config

result (as expected): Microsoft Windows [Versión 10.0.26200.7462] (c) Microsoft Corporation. Todos los derechos reservados.

C:\Users\walte>curl http://localhost:8000/v1/config { "OpenVINO/Phi-3.5-mini-instruct-int4-ov" : { "model_version_status": [ { "version": "1", "state": "AVAILABLE", "status": { "error_code": "OK", "error_message": "OK" } } ] } } C:\Users\walte>

run the request to get an inference: curl -s http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{"model": "OpenVINO/Phi-3.5-mini-instruct-int4-ov", "max_tokens": 30, "temperature": 0, "stream": false, "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What are the 3 main tourist attractions in Paris?"}]}"

response: C:\Users\walte>curl -s http://localhost:8000/v3/chat/completions -H "Content-Type: application/json" -d "{"model": "OpenVINO/Phi-3.5-mini-instruct-int4-ov", "max_tokens": 30, "temperature": 0, "stream": false, "messages": [{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What are the 3 main tourist attractions in Paris?"}]}"

C:\Users\walte>

Not output at all and no information on the running server window: [2025-12-11 15:18:36.175][6456][modelmanager][info][mediapipegraphdefinition.cpp:184] Mediapipe: OpenVINO/Phi-3.5-mini-instruct-int4-ov kfs pass through: false [2025-12-11 15:18:36.175][6456][modelmanager][info][pipelinedefinitionstatus.hpp:60] Mediapipe: OpenVINO/Phi-3.5-mini-instruct-int4-ov state changed to: AVAILABLE after handling: ValidationPassedEvent: [2025-12-11 15:18:36.176][6456][serving][info][servablemanagermodule.cpp:55] ServableManagerModule started [2025-12-11 15:18:36.176][8460][modelmanager][info][modelmanager.cpp:1201] Started model manager thread [2025-12-11 15:18:36.177][16812][modelmanager][info][modelmanager.cpp:1220] Started cleaner thread

C:\Users\walte\modelserver>

Printscreen.docx

Relevant log output


Issue submission checklist

  • [x] I'm reporting an issue. It's not a question.
  • [x] I checked the problem with the documentation, FAQ, open issues, Stack Overflow, etc., and have not found a solution.
  • [x] There is reproducer code and related data files such as images, videos, models, etc.

waltersh avatar Dec 11 '25 14:12 waltersh

Hi @waltersh,

Since the model runs fine on CPU but fails silently on the GPU, and considering you are using an Intel UHD 620 (Gen9.5 architecture) on Windows, this is highly likely a Windows TDR (Timeout Detection and Recovery) event or a shared memory limit.

Running a Transformer model like Phi-3.5 on Gen9.5 integrated graphics is computationally heavy. If a single inference layer takes longer than 2 seconds (the Windows default), the OS assumes the GPU has hung and forcibly resets the driver, causing the application to vanish without a standard error log.

Could you please try the following steps to confirm and fix this?

1. We need to tell Windows to wait longer before killing the GPU task.

  1. Open the Registry Editor (regedit) as Administrator.
  2. Navigate to: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers
  3. Right-click in the right pane and select New > DWORD (32-bit) Value.
  4. Name it TdrDelay.
  5. Double-click it, select Decimal, and set the value to 10 (or 20 to be safe).
  6. Restart your computer (required for this to take effect).

After restarting, try running the server again.

2. If it still crashes, we need to see the last operation before the exit. Please run the server with the DEBUG log level:

ovms.exe --source_model OpenVINO/Phi-3.5-mini-instruct-int4-ov --model_repository_path models --rest_port 8000 --task text_generation --target_device GPU --cache_size 2 --log_level DEBUG

3. The UHD 620 uses your system RAM as video memory.

  • How much total RAM does your laptop have? (e.g., 8GB, 16GB)
  • If you have 8GB or less, the system might be terminating the process to protect the OS kernel, as the model + KV cache + Windows overhead might exceed available physical memory.

Please let me know if the TdrDelay fix resolves it!

Jiya873 avatar Dec 11 '25 19:12 Jiya873

Hi @Jiya

Thanks so much for you response.

Not, unfortunately after the change of TdrDelay the behavior is the same [image: {C0473A53-5C7C-4CE7-A4C0-0EA30993848B}.png] My laptops has 32GB of RAM. Model on OVMS is being loaded to GPU and took less than 4GB of 16GB available of shared memory.

Is there a way to set TdfDelay to wait for ever? (or something like that)

El jue, 11 dic 2025 a las 20:50, Jiya Gupta @.***>) escribió:

Jiya873 left a comment (openvinotoolkit/openvino#33208) https://github.com/openvinotoolkit/openvino/issues/33208#issuecomment-3643543046

Hi @waltersh https://github.com/waltersh,

Since the model runs fine on CPU but fails silently on the GPU, and considering you are using an Intel UHD 620 (Gen9.5 architecture) on Windows, this is highly likely a Windows TDR (Timeout Detection and Recovery) event or a shared memory limit.

Running a Transformer model like Phi-3.5 on Gen9.5 integrated graphics is computationally heavy. If a single inference layer takes longer than 2 seconds (the Windows default), the OS assumes the GPU has hung and forcibly resets the driver, causing the application to vanish without a standard error log.

Could you please try the following steps to confirm and fix this?

  1. We need to tell Windows to wait longer before killing the GPU task.

    1. Open the Registry Editor (regedit) as Administrator.
    2. Navigate to: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers
    3. Right-click in the right pane and select New > DWORD (32-bit) Value.
    4. Name it TdrDelay.
    5. Double-click it, select Decimal, and set the value to 10 (or 20 to be safe).
    6. Restart your computer (required for this to take effect).

After restarting, try running the server again. 2. If it still crashes, we need to see the last operation before the exit. Please run the server with the DEBUG log level:

ovms.exe --source_model OpenVINO/Phi-3.5-mini-instruct-int4-ov --model_repository_path models --rest_port 8000 --task text_generation --target_device GPU --cache_size 2 --log_level DEBUG

  1. The UHD 620 uses your system RAM as video memory.

    • How much total RAM does your laptop have? (e.g., 8GB, 16GB)
    • If you have 8GB or less, the system might be terminating the process to protect the OS kernel, as the model + KV cache + Windows overhead might exceed available physical memory.

Please let me know if the TdrDelay fix resolves it!

— Reply to this email directly, view it on GitHub https://github.com/openvinotoolkit/openvino/issues/33208#issuecomment-3643543046, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANUA4EOHFTRDVXCYOQ3AZD4BHDJPAVCNFSM6AAAAACOXXA6A6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMNBTGU2DGMBUGY . You are receiving this because you were mentioned.Message ID: @.***>

waltersh avatar Dec 16 '25 10:12 waltersh

I tryed on linux and got same behavior, but at least more debug information. [image: image.png]

El mar, 16 dic 2025 a las 11:01, walter Shraiber @.***>) escribió:

Hi @Jiya

Thanks so much for you response.

Not, unfortunately after the change of TdrDelay the behavior is the same [image: {C0473A53-5C7C-4CE7-A4C0-0EA30993848B}.png] My laptops has 32GB of RAM. Model on OVMS is being loaded to GPU and took less than 4GB of 16GB available of shared memory.

Is there a way to set TdfDelay to wait for ever? (or something like that)

El jue, 11 dic 2025 a las 20:50, Jiya Gupta @.***>) escribió:

Jiya873 left a comment (openvinotoolkit/openvino#33208) https://github.com/openvinotoolkit/openvino/issues/33208#issuecomment-3643543046

Hi @waltersh https://github.com/waltersh,

Since the model runs fine on CPU but fails silently on the GPU, and considering you are using an Intel UHD 620 (Gen9.5 architecture) on Windows, this is highly likely a Windows TDR (Timeout Detection and Recovery) event or a shared memory limit.

Running a Transformer model like Phi-3.5 on Gen9.5 integrated graphics is computationally heavy. If a single inference layer takes longer than 2 seconds (the Windows default), the OS assumes the GPU has hung and forcibly resets the driver, causing the application to vanish without a standard error log.

Could you please try the following steps to confirm and fix this?

  1. We need to tell Windows to wait longer before killing the GPU task.

    1. Open the Registry Editor (regedit) as Administrator.
    2. Navigate to: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers
    3. Right-click in the right pane and select New > DWORD (32-bit) Value.
    4. Name it TdrDelay.
    5. Double-click it, select Decimal, and set the value to 10 (or 20 to be safe).
    6. Restart your computer (required for this to take effect).

After restarting, try running the server again. 2. If it still crashes, we need to see the last operation before the exit. Please run the server with the DEBUG log level:

ovms.exe --source_model OpenVINO/Phi-3.5-mini-instruct-int4-ov --model_repository_path models --rest_port 8000 --task text_generation --target_device GPU --cache_size 2 --log_level DEBUG

  1. The UHD 620 uses your system RAM as video memory.

    • How much total RAM does your laptop have? (e.g., 8GB, 16GB)
    • If you have 8GB or less, the system might be terminating the process to protect the OS kernel, as the model + KV cache + Windows overhead might exceed available physical memory.

Please let me know if the TdrDelay fix resolves it!

— Reply to this email directly, view it on GitHub https://github.com/openvinotoolkit/openvino/issues/33208#issuecomment-3643543046, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANUA4EOHFTRDVXCYOQ3AZD4BHDJPAVCNFSM6AAAAACOXXA6A6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMNBTGU2DGMBUGY . You are receiving this because you were mentioned.Message ID: @.***>

waltersh avatar Dec 17 '25 07:12 waltersh