CUDA Support - AI Applications CANNOT be deployed on Windows Containers
Windows Containers do not support running CUDA inside a container. This is due to Windows Container limitations. This is a HUGE strategic gap that means that AI applications need to be deployed on Linux not Windows just because of this. Microsoft must add CUDA support to Windows Containers as soon as possible. DirectX is not enough - it has to be full CUDA.
Is your feature request related to a problem? Please describe. If you want to run an AI application in a Windows Container you cannot because CUDA is not supported. DirectX only (which is supported) is useless for AI.
Describe the solution you'd like Enable CUDA support in Windows Containers on Windows Server 2025+. Work with NVIDIA to release a CUDA-enabled base docker image.
Describe alternatives you've considered Running CUDA in WSL is preposterous. It is way too slow. This is not a solution. We need native CUDA support now or Windows will become an obsolete platform.
Additional context See above
Thank you for creating an Issue. Please note that GitHub is not an official channel for Microsoft support requests. To create an official support request, please open a ticket here. Microsoft and the GitHub Community strive to provide a best effort in answering questions and supporting Issues on GitHub.
Hi, we're aware this is something that people want. In the meantime, could you explain more about what you've tried to get CUDA up and running? Did you use Windows Server 2025?
Just wanted to follow up on this. Microsoft did add support to Windows Server 2025 for GPU scenarios within Windows Containers beyond the classic DirectX support. So, you should now be able to run CUDA applications within a Windows Server Containers.
As a quick test I just built the devicequery sample from nVidia's CUDA samples. See this page for reference: https://docs.nvidia.com/cuda/cuda-installation-guide-microsoft-windows/index.html#compiling-sample-projects%5B/url%5D
I used Visual Studio 2022 to compile the sample and put the devicequery.exe in a directory called "demo" on my host. Then I copied the various visual studio runtime libraries and a few CUDA runtime files as well into that directory.
Then I launched my container as follows:
docker run --rm -it -v c:\demo:c:\demo --isolation=process --device class/5B45201D-F2F2-4F3B-85BB-30FF1F953599 mcr.microsoft.com/windows/servercore:ltsc2025
Note: the parameter "--device class/5B45201D-F2F2-4F3B-85BB-30FF1F953599" is required because when this is set the OS will ensure that required files from the host windows\system32 directory are mapped into the container. See this link for more info: https://techcommunity.microsoft.com/blog/containers/bringing-gpu-acceleration-to-windows-containers/393939
If I then navigate to the \demo directory, and run devicequery.exe, I see various info about my GPU:
=-=-=-= deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA RTX A4000" CUDA Driver Version / Runtime Version 12.9 / 12.9 CUDA Capability Major/Minor version number: 8.6 Total amount of global memory: 16376 MBytes (17170956288 bytes) (048) Multiprocessors, (128) CUDA Cores/MP: 6144 CUDA Cores GPU Max Clock rate: 1560 MHz (1.56 GHz) Memory Clock rate: 7001 Mhz Memory Bus Width: 256-bit L2 Cache Size: 4194304 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total shared memory per multiprocessor: 102400 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 1536 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model) Device supports Unified Addressing (UVA): Yes Device supports Managed Memory: Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: No Device PCI Domain ID / Bus ID / location ID: 0 / 195 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 12.9, CUDA Runtime Version = 12.9, NumDevs = 1 Result = PASS =-=-=-=
If I don't specify the "--device class/5B45201D-F2F2-4F3B-85BB-30FF1F953599" switch, then the attempt to run the program will fail with the following:
=-=-=-= deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 35 -> CUDA driver version is insufficient for CUDA runtime version Result = FAIL =-=-=-=
It fails without the switch, because certain required files from the host aren't mapped into the guest without the switch being set.
I hope this helps, Erick Smith (Microsoft)
This issue has been open for 90 days with no updates. no assignees, please provide an update or close this issue.