ZLUDA
ZLUDA copied to clipboard
Any plans for testing this with Pytorch/Tensorflow/mxnet?
This project would be a life saver (if it works) for so many researchers and enthusiasts who need GPU capabilities to train even basic models. Intel GPUs are cheap and abundant and having this project make these ML tools work would be immensely helpful.
Not expecting performance to be on par, I understand NVIDIA has better hardware, but even Intel GPUs can open up so many possibilities.
Also wondering what help might be needed to make this a reality.
To give some perspective, there are tons of Intel GPUs on Windows laptops and Mac. If ZLUDA can enable their owners to train models using the GPU it would be a huge productivity boost for a lot of people.
Excellent question.
Firstly, if you simply want those frameworks to run on an Intel GPUs, then ZLUDA might not be the best way. Deep Learning is quite high on the Intel priority list. Every major framework (certainly PyTorch, Tensorflow, MXNet, PaddlePaddle) has a dedicated team working on it to bring the best performance on CPU and GPU. I'm not tracking this all that closely, but AFAIK with the release of oneAPI 1.0 in December all the frameworks should be natively supported with no fiddling. More info here: https://techdecoded.intel.io/resources/intel-oneapi-beta-releases/ and in this video: https://www.youtube.com/watch?v=9y3xpi-yPyA
I recommend that you use the official support made by an army of full-time engineers and funded by a massive corporation. Thought if you (or anybody else) want to hack it into Intel GPUs through ZLUDA I'm keen to support it 😁
The question really boils down to running cuDNN. That's what every frameworks uses to have high performance on NVIDIA. One way to support it to keep adding missing functionality in both major translation libraries: PTX->SPIR-V kernel compiler and in the CUDA -> Level 0 host code translation layer until you can support enough of cuDNN to run those frameworks. This, somewhat counter-intuitively, is the wrong approach to the problem. As you might be less or more aware NV GPU architecture and Intel GPU architecture are slightly different. This is not a big deal for majority of kernels. This is a big deal for first-party kernels. First-party kernels and especially kernels in such a major library are optimized to an abnormal degree. GPU Code is written with particular exact hardware capabilities in mind: register space, instruction set, bandwidth, compute. It's simply not possible to run it on different hardware without incurring a fairly major performance hit. Furthermore, to squeeze a little bit more performance, I fully expect NVIDIA to use internal CUDA driver API. Debugging and re-implementing this is a lot of work. Additionally, we rely on the applications to ship architecture-independent PTX code. Pretty much all third-party libraries and applications do this in order to support future hardware out of the box. First-party libraries shipped with the drivers (like cuDNN) have very little reason to do this.
But there's a different, more performant solution! The approach is to create a ZLUDADNN. Basically, a library with the same API as cuDNN that is injected alongside ZLUDA and which contains kernels optimized for Intel GPUs. You would wrap oneDNN and forward all the calls to it, so e.g. cudnnSetConvolution2dDescriptor(...) uses mkldnn::convolution_forward::primitive_desc(...). This would be much more performant solution, but there are several issues one should be aware of:
- You would need to unpack CUDA-ZLUDA stream to Level 0 command queue etc. This can be fairly easily done by having ZLUDA expose a special function or functions to peek behind the covers. E.g. something like
zluda_cuda_stream_to_ze_queue(...). I would be highly interested in merging and shipping whatever is useful for this purpose in ZLUDA - oneDNN is a bit behind the curve and is OpenCL-only with a DPC++ & Level Zero branch that is AFAIK not useful yet. One would need to either rewrite oneDNN host code to Level 0, convince the oneDNN team to do so or awkwardly interop between Level 0 and OpenCL (at a performance cost).
- cuDNN API and oneDNN API are different, some hacks might be involved and worst case scenario one might be forced to fork oneDNN
- cuDNN API is really massive, it is a lot of work
I hope that makes sense, feel free to ask if it does not.
I think I was being naive to expect this. Even oneDNN works only on Xeon. I was thinking of taking advantage of all that cheap Intel hardware sitting on people's laptops doing not much. Might not be comparable to NVIDIA, but it would definitely beat CPUs. When it comes to performance in DNN, all you need is to beat the CPU, not compare to NVIDIA.
I am quite miffed that game developers are able to squeeze performance out of most consumer GPUs, but DNN can't. Also looking at DNN frameworks, they have very little interest in running on consumer GPUs, so I guess, I give up.
Feel free to close.
Wait, oneDNN supports Intel GPUs with full performance. AFAIK the problem is with actually making TF, pyTorch, etc. use oneDNN-GPU efficiently, expose API for custom primitives, etc. If you use oneDNN with GPU backend as if it's a CPU (e.g. malloc at will, memory copy at will, operations at will instead of forming a graph) then the performance suffers. Full support of frameworks to use oneDNN with GPUs should be released in December with oneAPI 1.0 Gold.
I'll leave this Issue. It might be useful for anyone who wants to support first-party libraries, whether it's cuDNN or cuBLAS
Hi,
I am a deep learning researcher heavily relying on pytorch / tensorflow. I have limited knowledge in hardware, especially when it's low-level.
The most annoying thing using NVIDIA GPUs for me is the limited amount of RAM, unless you have an RTX 3090. If I can get to run your project on an Intel integrated GPU (e.g. UHD630), does this mean that I can train / test big neural network models on the CPU RAM, since it's shared with the integrated GPU RAM?
Thanks!
Tae
I am closing this as no longer relevant. Feel free to reopen if it still appplies to the new version
I think this is still very relevant. Perhaps even more than before.
It seems this issue got closed by an automated script. Could we reopen @vosen ?
Thanks!
@dumblob I've closed this issue (and all the old issues) because it applies to the old, Intel-based version. The discussion here is 3 years and ~300k lines of code out of date. Please open a new issue