llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

ggml : add DirectML backend

Open ggerganov opened this issue 1 year ago • 9 comments

It seems like DirectML supports the upcoming NPU-enabled chips for Windows machines: https://devblogs.microsoft.com/directx/introducing-neural-processor-unit-npu-support-in-directml-developer-preview/

I don't think there is any other way to tap into this hardware, so we should explore if it possible to add this library as a backend in ggml in order to run stuff on the NPUs. There has been some semi-related work in the past that combined ggml and Direct3D: https://github.com/Const-me/Whisper. Not sure if it is relevant at all, maybe just as an inspiration

ggerganov avatar Jun 05 '24 14:06 ggerganov

Great idea, it looks like a lot of the upcoming AI hardware is going to have NPUs.

arch-btw avatar Jun 05 '24 18:06 arch-btw

I am not convinced that a DirectML backend is possible, the operators are too high level and new ones cannot be added. This means that we cannot implement a matrix multiplication operator that supports our quant formats. It might be possible to do it with DirectX 12 shaders, but at that point it would be a DirectX 12 backend more than a DirectML backend. It would not allow using onnx models regardless.

slaren avatar Jun 06 '24 21:06 slaren

Would using DirectX 12 shaders allow us to run stuff on the NPU? I suppose no, but just making sure. The main point of a potential DirectML backend would be to utilize the NPU. If it is too high-level (i.e. something like what CoreML is on Apple Silicon) then I agree it's not worth or possible to add support for it

ggerganov avatar Jun 07 '24 06:06 ggerganov

I am not sure it is possible to create custom NPU kernels at all. https://github.com/openvinotoolkit/npu_plugin seems to contain a compiler for the Intel NPU, but it's not clear if it is complete, and they have removed the source of the kernels that should be located in https://github.com/openvinotoolkit/npu_plugin/tree/develop/sw_runtime_kernels, leaving only the binary blobs.

slaren avatar Jun 07 '24 07:06 slaren

Interesting, but they have a PyTorch implementation, I thought PyTorch is pretty diverse in what it can support, but I have not much insight into the base here. Or is PyTorch automatically supplementing what isn't supported by DirectML with CPU?

sinni800 avatar Jul 06 '24 01:07 sinni800

Interesting, but they have a PyTorch implementation, I thought PyTorch is pretty diverse in what it can support, but I have not much insight into the base here. Or is PyTorch automatically supplementing what isn't supported by DirectML with CPU?

it should automatically fallback to CPU in this case.

kylo5aby avatar Jul 12 '24 06:07 kylo5aby

@slaren the lower level D3D metacommands interface leveraged by DirectML is not publicly documented.

The Intel NPU d3d12 drivers have a shader compiler and accept custom kernels. But the DirectML driver for the NPU on Qualcomm systems is metacommands only, w/ no custom kernel support, at least so far.

woachk avatar Jul 24 '24 02:07 woachk

@woachk thanks, that's useful information. If the Intel NPU driver accepts custom kernels via d3d12 shaders, I expect that it would be possible to fully support it through a d3d12 backend. For NPUs that only support DirectML, it may still be possible to support fp16 and fp32 models. It may also be possible to create a backend that transparently quantizes the tensors to an internal format, and in this way it may be possible to support the quantization type of DirectML, although that would be a significant deviation from the current backends. Personally I don't think that a backend that cannot use the ggml quantization types would be very useful.

slaren avatar Jul 24 '24 03:07 slaren

they have a PyTorch implementation

While it might be helpful, Torch (and in extension PyTorch) is not working on all platforms. Torch does not support ARM (or RISC-V) architectures and also lacks support for NPUs.

MovGP0 avatar Feb 14 '25 08:02 MovGP0

@ggerganov So I went and forked the project and have been iterating to the point where I have a basic understanding of the layout, such that I've made stubs and can compile/link and run a simple unit test against my ggml-backend-dx12.cpp. Can you explicate post-conditions and offer some guidance on what the most important features for me to focus on are? I'm a senior in a C.S. program, but I'm not taking any classes right now because we just had a baby, so I'd like to focus on this contribution to learn more and be a useful member of the community.

EDIT: Found this https://github.com/Const-me/Whisper and am going to leverage all of this hard work since DX11 and DX12 are super close.

cafeTechne avatar Apr 01 '25 19:04 cafeTechne

There is a bad news, DirectML has been moved to "maintenance mode." : https://github.com/microsoft/DirectML/pull/710/files

aisk avatar Aug 15 '25 13:08 aisk

It also has been removed from the Intel NPU drivers already.

mediouni-m avatar Aug 15 '25 13:08 mediouni-m

lol interestingly enough maybe not so much has been changed despite "maintenance mode" 😅

Windows Machine Learning is a high-performance machine learning inference API that is powered by ONNX Runtime and DirectML.

https://github.com/microsoft/Windows-Machine-Learning#windows-machine-learning

But of course, the AI ecosystem is moving faster than others, this feature request is one year old, in 2026 there will be new changes in frameworks and tools and models...

reneleonhardt avatar Aug 15 '25 13:08 reneleonhardt

DirectML is deprecating, New WindowsML is the one we should move forward. Can support be added for WindowsML backend? https://learn.microsoft.com/en-us/windows/ai/new-windows-ml/overview

nathansun0921 avatar Oct 02 '25 18:10 nathansun0921

Windows ML is onnxruntime with a delegate distribution infrastructure

mediouni-m avatar Oct 02 '25 19:10 mediouni-m

DirectML is deprecating, New WindowsML is the one we should move forward. Can support be added for WindowsML backend? https://learn.microsoft.com/en-us/windows/ai/new-windows-ml/overview

maintenance mode

doesn't mean deprecating

tishion avatar Oct 20 '25 01:10 tishion

Windows ML is onnxruntime with a delegate distribution infrastructure

Sounds like focusing time on onnx upstream is time better spent than focusing on DX 12?

cafeTechne avatar Nov 27 '25 19:11 cafeTechne

Sounds like focusing time on onnx upstream is time better spent than focusing on DX 12?

DirectML is gone from both Intel and Qualcomm NPU drivers at this point, and was never there on the AMD ones.

mediouni-m avatar Nov 28 '25 01:11 mediouni-m

Thanks for this. I am personally invested in vulkan for philosophical and pragmatic reasons so I was just riffing about onnx.

On Thu, Nov 27, 2025 at 8:42 PM M. Mediouni @.***> wrote:

mediouni-m left a comment (ggml-org/llama.cpp#7772) https://github.com/ggml-org/llama.cpp/issues/7772#issuecomment-3587662031

Sounds like focusing time on onnx upstream is time better spent than focusing on DX 12?

DirectML is gone from both Intel and Qualcomm NPU drivers at this point, and was never there on the AMD ones.

— Reply to this email directly, view it on GitHub https://github.com/ggml-org/llama.cpp/issues/7772#issuecomment-3587662031, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFACJ5VIWW7LJ7WWLXQ47TL366SCDAVCNFSM6AAAAABWN354L2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTKOBXGY3DEMBTGE . You are receiving this because you commented.Message ID: <ggml-org/llama. @.***>

cafeTechne avatar Nov 28 '25 01:11 cafeTechne