vscode-ai-toolkit Phi-4-mini-gpu-int4-rtn-block-32 (DirectML/CUDA - Small, Standart) fails after several messages

2025-05-27 11:43:50.848 [info] Information: Microsoft.Neutron.OpenAI.Delegates.OpenAIApi [0] 2025-05-27T11:43:50.8481637+02:00 HandleChatCompletionAsStreamRequest -> model:Phi-4-mini-gpu-int4-rtn-block-32 MaxCompletionTokens:(null) maxTokens:256 temperature:(null) topP:(null) 2025-05-27 11:43:50.853 [info] Information: Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderOnnx [0] 2025-05-27T11:43:50.8525922+02:00 HandleChatCompletionAsStreamRequest -> model:Phi-4-mini-gpu-int4-rtn-block-32 MaxCompletionTokens:(null) maxTokens:256 temperature:(null) topP:(null) 2025-05-27 11:43:50.856 [info] Error: Microsoft.AspNetCore.Server.Kestrel [13] 2025-05-27T11:43:50.8556186+02:00 Connection id "0HNCT064QJ14U", Request id "0HNCT064QJ14U:00000001": An unhandled exception was thrown by the application. error: [D:\a_work\1\s\onnxruntime\core\providers\dml\DmlExecutionProvider\src\DmlCommittedResourceAllocator.cpp(22)\onnxruntime.dll!00007FFEA098C431: (caller: 00007FFEA096D7BC) Exception(1) tid(1538) 887A0005 The GPU device instance has been suspended. Use GetDeviceRemovedReason to determine the appropriate action. , at Microsoft.ML.OnnxRuntimeGenAI.Result.VerifySuccess(IntPtr) + 0x54 at Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderOnnx.OnnxChatGenerator..ctor(OnnxLoadedModel, GeneratorParams, Sequences) + 0x44 at Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderOnnx.<CreateOnnxChatGeneratorAsync>d__22.MoveNext() + 0xdc8 --- End of stack trace from previous location --- at Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderOnnx.<CreateChatGeneratorAsync>d__15.MoveNext() + 0x54 --- End of stack trace from previous location --- at Microsoft.Neutron.OpenAI.Provider.OpenAIServiceProviderBase1.<HandleChatCompletionAsStreamRequestAsync>d__36.MoveNext() + 0x46d --- End of stack trace from previous location --- at Microsoft.Neutron.OpenAI.OpenAIJsonExtensions.<WriteChatCompletionResponses>d__19.MoveNext() + 0x40b --- End of stack trace from previous location --- at Microsoft.Neutron.OpenAI.OpenAIJsonExtensions.<WriteChatCompletionResponses>d__19.MoveNext() + 0x754 --- End of stack trace from previous location --- at Microsoft.Neutron.OpenAI.OpenAIJsonExtensions.<WriteChatCompletionResponses>d__19.MoveNext() + 0x89e --- End of stack trace from previous location --- at Microsoft.Neutron.OpenAI.OpenAIServiceWebApiExtensions.<>c__DisplayClass0_0.<<HandleStreamRequest>b__0>d.MoveNext() + 0x57 --- End of stack trace from previous location --- at Microsoft.AspNetCore.Http.Generated.<GeneratedRouteBuilderExtensions_g>F16C589DE9EC82483AA705851D2FE201CB4CB4AAF6561E8DE71B6A1891AD8D67F__GeneratedRouteBuilderExtensionsCore.<>c__DisplayClass11_0.<<MapPost7>g__RequestHandler|5>d.MoveNext() + 0x245 --- End of stack trace from previous location --- at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpProtocol.<ProcessRequests>d__2381.MoveNext() + 0x355] 2025-05-27 11:43:50.857 [error] Failed to chatStream. model = "ONNX/Phi-4-mini-gpu-int4-rtn-block-32", errorMessage = "Error: Unable to call the Phi-4-mini-gpu-int4-rtn-block-32 inference endpoint . Please check if the input or configuration is correct.", errorType = "c", errorObject = {"innerError":{"code":"ERR_STREAM_PREMATURE_CLOSE"}} 2025-05-27 11:43:50.857 [error] Unable to call the Phi-4-mini-gpu-int4-rtn-block-32 inference endpoint . Please check if the input or configuration is correct. Premature close

May 27 '25 09:05 EvgeniiVovchok

Hi Yauheni. It seems like you are running AI Toolkit on Windows devices, which leverages Microsoft Direct ML as the execution provider for your model. We also have CUDA for Nvidia GPU, which provides better native support for Nvidia GPU and results in better performance. Phi 4 mini model needs strong support from GPU and GDDR, the suspension is usually due to the limit of hardware. You may need to have the model running on GPU with better performance. May I ask which GPU are you using?

May 28 '25 06:05 thatChang

I use Nvidia GeForce RTX 3060 Mobile. It worked well for several messages in one chat. But then it crashed with this error. I tried to reload VS code, start a new chat, but nothing helped.

On Wed, May 28, 2025, 08:46 Chang Liu @.***> wrote:

thatChang left a comment (microsoft/vscode-ai-toolkit#210) https://github.com/microsoft/vscode-ai-toolkit/issues/210#issuecomment-2915163192

Hi Yauheni. It seems like you are running AI Toolkit on Windows devices, which leverages Microsoft Direct ML as the execution provider for your model. We also have CUDA for Nvidia GPU, which provides better native support for Nvidia GPU and results in better performance. Phi 4 mini model needs strong support from GPU and GDDR, the suspension is usually due to the limit of hardware. You may need to have the model running on GPU with better performance. May I ask which GPU are you using?

— Reply to this email directly, view it on GitHub https://github.com/microsoft/vscode-ai-toolkit/issues/210#issuecomment-2915163192, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJX7W7KXD6E2WH34CT3E33D3AVLVTAVCNFSM6AAAAAB57RCMT2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSMJVGE3DGMJZGI . You are receiving this because you authored the thread.Message ID: @.***>

May 28 '25 07:05 EvgeniiVovchok

You are experiencing a performance issue. Currently AI Toolkit support CPU and Direct ML as execution provider for models in Windows but CUDA is supported in Linux only. Direct ML provides vast variety support for GPUs of different IAs; however CUDA provides better native support for Nvidia GPU.

Two approaches could solve this issue:

Launch VS code in WSL2 after proper CUDA configuration. Here is the guide provided by Nvidia: https://docs.nvidia.com/cuda/wsl-user-guide/index.html After installation, simply execute 'code' command in WSL2 to have VS code launched. You could have this model loaded with help of CUDA.
Try the same model in CPU mode or other models with less resource usage.

May 29 '25 03:05 thatChang

Thanks for your support. When I installed CUDA drivers for WSL + cuDNN, then it worked

Jun 02 '25 09:06 EvgeniiVovchok