rfcs
rfcs copied to clipboard
RFC-0037-Interoperability-Standard-of-3rd-Backend-Integration-Mechanism
Proposal to add interoperability standard for 3rd backend based PrivateUse1 Mechanisam into PyTorch.
Rendered version: https://github.com/FFFrog/rfcs/blob/rfc-for-privateuse1/RFC-0037-Interoperability-Standard-of-3rd-Backend-Integration-Mechanism.md
Thank you a lot for @albanD review.
Based on your comments, I have prepared a new diagram to illustrate the architecture outline, please take a look at it.
Based on your comments, I have prepared a new diagram to illustrate the architecture outline, please take a look at it.
Thanks for the updated arch diagram. Some comments on the XPU. I feel listing XPU as "partial out-tree" and list it similarly with other out-of-tree devices might look a bit confusing:
- XPU is expected to be fully functional within PyTorch core in the short term, just as CUDA. Even though, some of the ATen ops are supported via an out-of-the-tree repo but it is added as a third-party repo of PyTorch core and build together. This is different from other third-party devices which are maintained out of the tree.
- Maintaining an out-of-the-tree ATen repo is an interim approach to facilitate the XPU upstreaming. We would target them in-tree in longer term. So eventually, it would be "all in-tree".
Based on your comments, I have prepared a new diagram to illustrate the architecture outline, please take a look at it.
Thanks for the updated arch diagram. Some comments on the XPU. I feel listing XPU as "partial out-tree" and list it similarly with other out-of-tree devices might look a bit confusing:
- XPU is expected to be fully functional within PyTorch core in the short term, just as CUDA. Even though, some of the ATen ops are supported via an out-of-the-tree repo but it is added as a third-party repo of PyTorch core and build together. This is different from other third-party devices which are maintained out of the tree.
- Maintaining an out-of-the-tree ATen repo is an interim approach to facilitate the XPU upstreaming. We would target them in-tree in longer term. So eventually, it would be "all in-tree".
So sorry for the introduced confusion.
I have updated arch diagram again, please have a look at it, thank you.
So sorry for the introduced confusion. I have updated arch diagram again, please have a look at it, thank you.
The updated diagram looks good to me. Thanks for taking the time!
Can you please setup some mailing list or update policy for other out-of-tree backend developers.
I'm the author of OpenCL backend https://github.com/artyom-beilis/pytorch_dlprim and now I'm catching up with changes in 2.3 and 2.4 - since I do it only part time it would be better to have some notices and updates in advance.
Can you please setup some mailing list or update policy for other out-of-tree backend developers.
I'm the author of OpenCL backend https://github.com/artyom-beilis/pytorch_dlprim and now I'm catching up with changes in 2.3 and 2.4 - since I do it only part time it would be better to have some notices and updates in advance.
Sorry for the late reply, I'm on vacation recently.
My colleagues and I have started development work, and the initial version will support Runtime.
Recently we will try to communicate with the community, with the goal of creating a project under the PyTorch organization, and then we will push our initial version to the project as soon as possible.
If you are interested, you are more than welcome to participate in this work.
Recently we will try to communicate with the community, with the goal of creating a project under the PyTorch organization, and then we will push our initial version to the project as soon as possible.
If you are interested, you are more than welcome to participate in this work.
Which project? In general anything that would simplify maintaining out-of-tree backend is welcome :-)
Because I work on it in my spare time and sometimes I just can't keep up with all changes
Thanks for the update! It sounds great.
Can you please setup some mailing list or update policy for other out-of-tree backend developers.
I think a mailing list might be a bit challenging but looking at the change history of the demo module should be able to give an idea of what was added/changed recently.
In general anything that would simplify maintaining out-of-tree backend is welcome :-) Because I work on it in my spare time and sometimes I just can't keep up with all changes
There is quite a bit of churn, and I expect there will still be for a few more months as we fully stabilize the new improved API (you might want to wait a bit to upgrade if you don't have much time). I do expect that it will quickly pay off though as, having a shared interface and extension point will allow us to improve both ease of use (because we designed this API for that exact purpose) and stability (because we have multiple users that will catch accidental regressions).
@artyom-beilis, if there are no other special circumstances, we will open source our project providing PyTorch Third-party Reference Demo by using PrivateUse1 mechinasm in the next week or so.
This is what we want to do with the project, I just drew a simple diagram, more detailed information must be found in the CODE.
@albanD, I drew a simple diagram of the overall project structure and what we want to do.
I want to explain something about the diagram.
- The xpu in the picture is different from Intel's xpu. It is just a name for a general device.
- The design of many manufacturers' APIs will more or less draw on CUDA, so using CUDA as the standard can maximize compatibility with various third-party devices.
- If the device has its own dedicated API, then the module with a blue background in the picture may need to be changed; if the device API is similar to CUDA, in theory only a few changes are needed.
Hi, @albanD @artyom-beilis , sorry for the late feedback.
At present, we have implemented the first version of Demo according to the community's latest third-party device integration mechanism. The main framework has been completed, including basic general Runtime capabilities, operator registration, autocast, etc.
Of course, there are still many general details that have not been completed, such as:
- More General: Except for the npu directory in the root directory (which is a collection of specific backend functions and can be replaced with other backends), remove all npu-related representations, such as torch_npu renamed to torch_backend, csrc/npu renamed to csrc/backend, etc.
- Codegen: redesigned to facilitate out-of-the-box use of new other backends
- Backend custom API: provide backend custom API integration capabilities
- Documentation: end-to-end documentation
- Test cases sets: general test case collection, etc.
We will work hard to improve the above features and other details. After all are ready, we will try to integrate CUDA to PyTorch through this Demo and provide a full-process integration tutorial.
If you have any questions or suggestions, please let me know. Thank you.
Hi @FFFrog
Looking at NPU's readme
This project provides a foundational layer that abstracts and standardizes the interaction between PyTorch and different types of GPU hardware.
Does it mean I'm expected to implement out of tree backend in terms of NPU, i.e. as sort of NPU extension? Should I use it or can I continue working on existing out of tree implementation?
Finally there is like 3 GPU OOT implementations (I'm aware of) around Intel's XPU, Apple Metal and my dlprimitives/OpenCL
This project provides a foundational layer that abstracts and standardizes the interaction between PyTorch and different types of GPU hardware.
Does it mean I'm expected to implement out of tree backend in terms of NPU, i.e. as sort of NPU extension? Should I use it or can I continue working on existing out of tree implementation?
First of all, thank you very much for your comments.
Challenges of integrating the new backend into PyTorch through the third-party device integration mechanism based on PrivateUse1:
- High development threshold: The third-party device integration mechanism is mainly implemented by various scattered and irregular HOOKs and registration mechanisms, lacking a unified view
- Poor reusability: There are many common features between the backends integrated based on the third-party device integration mechanism, such as codegen (automatically implement operator registration, custom operator routing, forward and backward binding, etc.), PyTorch common API, common memory pool strategy, common test case set, etc.
However, due to the possible differences between various third-party backends, in order to ensure universality as much as possible, our current strategy considers CUDA as the standard, and all other backend APIs need to align themselves with the CUDA API (CUDA currently dominates the field of artificial intelligence, and the CUDA API is also well known in the industry)
For this DEMO project, we plan to divide it into two phases:
- Phase 1: This is the phase we are in now, which requires COPY&Modify. This project is mainly used as a reference implementation.
- Phase 2: This is what we will do next, completing the device abstraction layer, and the third-party backend will serve as the plug-in for this demo (it is worth adding that for the PyTorch general API, ideally, the third-party backend only needs to implement the backend API corresponding to the CUDA API, but for the case of custom APIs, the backend currently needs to complete the end-to-end integration of PyTorch by itself)
Back to your question, if your time permits, it is recommended that you wait until our device abstraction layer is completed before integrating PyTorch
@artyom-beilis
Finally there is like 3 GPU OOT implementations (I'm aware of) around Intel's XPU, Apple Metal and my dlprimitives/OpenCL
As far as I know:
-
Intel XPU: Currently in a semi-built-in state, will be fully in-tree later, its dispatchKey is
xpu
(dedicated key) -
Apple Matel: This should be an in-tree backend, its dispatchKey is
mps
(dedicated key) - dlprimitives/OpenCL: out-of-tree
-
Intel HPU: This is currently out-of-tree, its dispatchKey is
hpu
(dedicated key), but the OOT repo is not opensource -
Meta MTIA: out-of-tree, its dispatchKey is
mtia
(dedicated key) -
Huawei NPU: out-of-tree, its dispatchKey is
PrivateUse1
(public key)
High development threshold: The third-party device integration mechanism is mainly implemented by various scattered and irregular HOOKs and registration mechanisms, lacking a unified view
From my point of view it was mostly implementing operators - while biggest problem was to understand which are required, basic and what conditions are required - sometimes lack of documentation (for example what is the difference between _copy_from
and _copy_from_and_resize
???)
Poor reusability: There are many common features between the backends integrated based on the third-party device integration mechanism, such as codegen (automatically implement operator registration, custom operator routing, forward and backward binding, etc.),
There are two things - operators that can be implemented in terms of others - it would be nice to have some kind of operator tree that would show native operators and ones implemented in terms of others.
Regarding code-gen - do you mean automatic kernel code generation or building operators in terms of other operators?
PyTorch common API, common memory pool strategy,
Pool was probably the trickiest part to implement and still it is sub-optimal in terms of memory allocation. But still there are many interesting points to consider that aren't similar to CUDA: in OpenCL for example you can't use pointer arithmetics on host as in CUDA - you need to add an offset or use sub-buffers, some Integrated GPU devices share memory with CPU (Intel , AMD APUs, ARM)
common test case set, etc.
This would be awesome
Back to your question, if your time permits, it is recommended that you wait until our device abstraction layer is completed before integrating PyTorch
I understand but it is problematic. Since if I wait for an API to be finalised I'll wait forever ;-)
.
What is expected to change? The most critical part and most of the work are the operators implemented.
@artyom-beilis Hi, I will get back to you tomorrow, sorry for the inconvenience as I'm a bit busy lately.
From my point of view it was mostly implementing operators - while biggest problem was to understand which are required, basic and what conditions are required - sometimes lack of documentation (for example what is the difference between _copy_from and _copy_from_and_resize ???)
Yes, there are many operators in Pytorch and some of them are very similar to each other; we can simply divide all operators into two parts;
- Factory operators: all operators related to tensor creation, conversion, etc.
- Computational operators: all operators that deal with tensors
We will provide reference implementation and documents about all factory operators but not the latter because the latter is easy to understand
There are two things - operators that can be implemented in terms of others - it would be nice to have some kind of operator tree that would show native operators and ones implemented in terms of others.
Yes, PyTorch has many operators that are composed of other basic operators, we can provide a tree list as you described above, but there are some issues with timeliness because the relationship between operators will be updated An approach maybe can give you a hand:
- Compile Pytorch With DEBUG
- export TORCH_SHOW_DISPATCH_TRACE=1
- python -c "import torch; torch.rand(3,3)"
then, you will get the backstrace on operators like below, and will know which operator need to be implemented:
[call] op=[aten::rand], key=[BackendSelect]
[redispatch] op=[aten::rand], key=[CPU]
[call] op=[aten::empty.memory_format], key=[BackendSelect]
[redispatch] op=[aten::empty.memory_format], key=[CPU]
[call] op=[aten::uniform_], key=[CPU]
Regarding code-gen - do you mean automatic kernel code generation or building operators in terms of other operators?
As for codegen, it will generate a lot of codes according to our requirements, including but not limited to forward operator registration, backward operator registration, custom operator routing files, etc.
All you need to do is implemant operators related to specific backend and also need to provide a yaml, in which providing opeartor info, the codegen will automaticly generate all codes you need.
What I would like to add is that codegen is under development and the design of yaml has not yet been determined, you can take PyTorch yaml as an reference now.
Pool was probably the trickiest part to implement and still it is sub-optimal in terms of memory allocation. But still there are many interesting points to consider that aren't similar to CUDA: in OpenCL for example you can't use pointer arithmetics on host as in CUDA - you need to add an offset or use sub-buffers, some Integrated GPU devices share memory with CPU (Intel , AMD APUs, ARM)
I absolutely agree with you. We will plan to provide several basic memory pool strategies such as CUDA Memory Pool、OpenCL Memory Pool and etc. The new backend can use the most appropriate strategy to implement the new allocator according to the characteristics of the new backend. Of course, the new backend can also implement its own memory pool from scratch.
@albanD, sorry to bother you again. It seems that you are working on accelerator diversity in PyTorch, and perhaps this project could help you accelerate this goal, can you scan the project quickly and give us some advices ?
We will provide reference implementation and documents about all factory operators but not the latter because the latter is easy to understand
That would be fantastic... Because sometimes it just makes me wonder what this operator is doing and under what conditions.
All you need to do is implemant operators related to specific backend and also need to provide a yaml,
This is first time I hear of the yaml...
We will plan to provide several basic memory pool strategies such as CUDA Memory Pool、OpenCL Memory Pool
Yes, this would be nice, because currently pytorch opencl/dlprimitives backend suffers of much more extensive memory use that it should.
perhaps this project could help you accelerate this goal, can you scan the project quickly and give us some advices ?
Probably what I most needed is some kind of location were you can actually ask some questions for some stuff that isn't easy to understand so far @albanD did an amazing job helping with OpenCL backend. Currently I mostly ask questions on dev-discuss but sometimes I feel that there a lots of stuff (currently stuck on torch.load... without moving model to CPU and back to device)
Thanks a lot!
Our extreme goal is that new backend can be integrated into PyTorch smoothly by implementing some APIs and structures related with specific backend without considering any detailed stuff related to PyTorch,such as cpu fallback(@albanD have done it for dlprimitive),backend renaming and etc.
Of course, it is not always convenient for all backends to integrate into PyTorch by using this project, if the new backend doesn't involve many PyTorch features, doing it from scratch maybe also a good choice.
By the way, if you have any questions about new backend integration, you can also mention me or file a issue in the project, i am very glad to share with everyone.