rfcs RFC-0037-Interoperability-Standard-of-3rd-Backend-Integration-Mechanism

Proposal to add interoperability standard for 3rd backend based PrivateUse1 Mechanisam into PyTorch.

Rendered version: https://github.com/FFFrog/rfcs/blob/rfc-for-privateuse1/RFC-0037-Interoperability-Standard-of-3rd-Backend-Integration-Mechanism.md

Apr 02 '24 01:04 FFFrog

Thank you a lot for @albanD review.

Based on your comments, I have prepared a new diagram to illustrate the architecture outline, please take a look at it.

3rd_backend_architecture

May 09 '24 02:05 FFFrog

Based on your comments, I have prepared a new diagram to illustrate the architecture outline, please take a look at it.

Thanks for the updated arch diagram. Some comments on the XPU. I feel listing XPU as "partial out-tree" and list it similarly with other out-of-tree devices might look a bit confusing:

XPU is expected to be fully functional within PyTorch core in the short term, just as CUDA. Even though, some of the ATen ops are supported via an out-of-the-tree repo but it is added as a third-party repo of PyTorch core and build together. This is different from other third-party devices which are maintained out of the tree.
Maintaining an out-of-the-tree ATen repo is an interim approach to facilitate the XPU upstreaming. We would target them in-tree in longer term. So eventually, it would be "all in-tree".

May 14 '24 21:05 jgong5

Based on your comments, I have prepared a new diagram to illustrate the architecture outline, please take a look at it.

Thanks for the updated arch diagram. Some comments on the XPU. I feel listing XPU as "partial out-tree" and list it similarly with other out-of-tree devices might look a bit confusing:

XPU is expected to be fully functional within PyTorch core in the short term, just as CUDA. Even though, some of the ATen ops are supported via an out-of-the-tree repo but it is added as a third-party repo of PyTorch core and build together. This is different from other third-party devices which are maintained out of the tree.

Maintaining an out-of-the-tree ATen repo is an interim approach to facilitate the XPU upstreaming. We would target them in-tree in longer term. So eventually, it would be "all in-tree".

So sorry for the introduced confusion. I have updated arch diagram again, please have a look at it, thank you. 无标题-2024-03-25-0951

May 17 '24 09:05 FFFrog

So sorry for the introduced confusion. I have updated arch diagram again, please have a look at it, thank you.

The updated diagram looks good to me. Thanks for taking the time!

May 20 '24 01:05 jgong5

Can you please setup some mailing list or update policy for other out-of-tree backend developers.

I'm the author of OpenCL backend https://github.com/artyom-beilis/pytorch_dlprim and now I'm catching up with changes in 2.3 and 2.4 - since I do it only part time it would be better to have some notices and updates in advance.

Jul 02 '24 19:07 artyom-beilis

Can you please setup some mailing list or update policy for other out-of-tree backend developers.

I'm the author of OpenCL backend https://github.com/artyom-beilis/pytorch_dlprim and now I'm catching up with changes in 2.3 and 2.4 - since I do it only part time it would be better to have some notices and updates in advance.

Sorry for the late reply, I'm on vacation recently.

My colleagues and I have started development work, and the initial version will support Runtime.

Recently we will try to communicate with the community, with the goal of creating a project under the PyTorch organization, and then we will push our initial version to the project as soon as possible.

If you are interested, you are more than welcome to participate in this work.

Jul 12 '24 15:07 FFFrog

Recently we will try to communicate with the community, with the goal of creating a project under the PyTorch organization, and then we will push our initial version to the project as soon as possible.

If you are interested, you are more than welcome to participate in this work.

Which project? In general anything that would simplify maintaining out-of-tree backend is welcome :-) Because I work on it in my spare time and sometimes I just can't keep up with all changes

Jul 12 '24 18:07 artyom-beilis

Thanks for the update! It sounds great.

Can you please setup some mailing list or update policy for other out-of-tree backend developers.

I think a mailing list might be a bit challenging but looking at the change history of the demo module should be able to give an idea of what was added/changed recently.

In general anything that would simplify maintaining out-of-tree backend is welcome :-) Because I work on it in my spare time and sometimes I just can't keep up with all changes

There is quite a bit of churn, and I expect there will still be for a few more months as we fully stabilize the new improved API (you might want to wait a bit to upgrade if you don't have much time). I do expect that it will quickly pay off though as, having a shared interface and extension point will allow us to improve both ease of use (because we designed this API for that exact purpose) and stability (because we have multiple users that will catch accidental regressions).

Jul 17 '24 21:07 albanD

@artyom-beilis, if there are no other special circumstances, we will open source our project providing PyTorch Third-party Reference Demo by using PrivateUse1 mechinasm in the next week or so.

This is what we want to do with the project, I just drew a simple diagram, more detailed information must be found in the CODE.

Reference Demo

Jul 19 '24 15:07 FFFrog

@albanD, I drew a simple diagram of the overall project structure and what we want to do.

I want to explain something about the diagram.

The xpu in the picture is different from Intel's xpu. It is just a name for a general device.
The design of many manufacturers' APIs will more or less draw on CUDA, so using CUDA as the standard can maximize compatibility with various third-party devices.
If the device has its own dedicated API, then the module with a blue background in the picture may need to be changed; if the device API is similar to CUDA, in theory only a few changes are needed.

Jul 19 '24 16:07 FFFrog

Hi, @albanD @artyom-beilis , sorry for the late feedback.

At present, we have implemented the first version of Demo according to the community's latest third-party device integration mechanism. The main framework has been completed, including basic general Runtime capabilities, operator registration, autocast, etc.

Of course, there are still many general details that have not been completed, such as:

More General: Except for the npu directory in the root directory (which is a collection of specific backend functions and can be replaced with other backends), remove all npu-related representations, such as torch_npu renamed to torch_backend, csrc/npu renamed to csrc/backend, etc.
Codegen: redesigned to facilitate out-of-the-box use of new other backends
Backend custom API: provide backend custom API integration capabilities
Documentation: end-to-end documentation
Test cases sets: general test case collection, etc.

We will work hard to improve the above features and other details. After all are ready, we will try to integrate CUDA to PyTorch through this Demo and provide a full-process integration tutorial.

If you have any questions or suggestions, please let me know. Thank you.

Aug 06 '24 02:08 FFFrog

Hi @FFFrog

Looking at NPU's readme

This project provides a foundational layer that abstracts and standardizes the interaction between PyTorch and different types of GPU hardware.

Does it mean I'm expected to implement out of tree backend in terms of NPU, i.e. as sort of NPU extension? Should I use it or can I continue working on existing out of tree implementation?

Finally there is like 3 GPU OOT implementations (I'm aware of) around Intel's XPU, Apple Metal and my dlprimitives/OpenCL

Aug 06 '24 05:08 artyom-beilis

This project provides a foundational layer that abstracts and standardizes the interaction between PyTorch and different types of GPU hardware.

Does it mean I'm expected to implement out of tree backend in terms of NPU, i.e. as sort of NPU extension? Should I use it or can I continue working on existing out of tree implementation?

First of all, thank you very much for your comments.

Challenges of integrating the new backend into PyTorch through the third-party device integration mechanism based on PrivateUse1:

High development threshold: The third-party device integration mechanism is mainly implemented by various scattered and irregular HOOKs and registration mechanisms, lacking a unified view
Poor reusability: There are many common features between the backends integrated based on the third-party device integration mechanism, such as codegen (automatically implement operator registration, custom operator routing, forward and backward binding, etc.), PyTorch common API, common memory pool strategy, common test case set, etc.

However, due to the possible differences between various third-party backends, in order to ensure universality as much as possible, our current strategy considers CUDA as the standard, and all other backend APIs need to align themselves with the CUDA API (CUDA currently dominates the field of artificial intelligence, and the CUDA API is also well known in the industry)

For this DEMO project, we plan to divide it into two phases:

Phase 1: This is the phase we are in now, which requires COPY&Modify. This project is mainly used as a reference implementation.
Phase 2: This is what we will do next, completing the device abstraction layer, and the third-party backend will serve as the plug-in for this demo (it is worth adding that for the PyTorch general API, ideally, the third-party backend only needs to implement the backend API corresponding to the CUDA API, but for the case of custom APIs, the backend currently needs to complete the end-to-end integration of PyTorch by itself)

Back to your question, if your time permits, it is recommended that you wait until our device abstraction layer is completed before integrating PyTorch

Aug 07 '24 02:08 FFFrog

@artyom-beilis

Finally there is like 3 GPU OOT implementations (I'm aware of) around Intel's XPU, Apple Metal and my dlprimitives/OpenCL

As far as I know:

Intel XPU: Currently in a semi-built-in state, will be fully in-tree later, its dispatchKey is xpu(dedicated key)
Apple Matel: This should be an in-tree backend, its dispatchKey is mps(dedicated key)
dlprimitives/OpenCL: out-of-tree
Intel HPU: This is currently out-of-tree, its dispatchKey is hpu(dedicated key), but the OOT repo is not opensource
Meta MTIA: out-of-tree, its dispatchKey is mtia(dedicated key)
Huawei NPU: out-of-tree, its dispatchKey is PrivateUse1(public key)

Aug 07 '24 02:08 FFFrog

High development threshold: The third-party device integration mechanism is mainly implemented by various scattered and irregular HOOKs and registration mechanisms, lacking a unified view

From my point of view it was mostly implementing operators - while biggest problem was to understand which are required, basic and what conditions are required - sometimes lack of documentation (for example what is the difference between _copy_from and _copy_from_and_resize ???)

Poor reusability: There are many common features between the backends integrated based on the third-party device integration mechanism, such as codegen (automatically implement operator registration, custom operator routing, forward and backward binding, etc.),

There are two things - operators that can be implemented in terms of others - it would be nice to have some kind of operator tree that would show native operators and ones implemented in terms of others.

Regarding code-gen - do you mean automatic kernel code generation or building operators in terms of other operators?

PyTorch common API, common memory pool strategy,

Pool was probably the trickiest part to implement and still it is sub-optimal in terms of memory allocation. But still there are many interesting points to consider that aren't similar to CUDA: in OpenCL for example you can't use pointer arithmetics on host as in CUDA - you need to add an offset or use sub-buffers, some Integrated GPU devices share memory with CPU (Intel , AMD APUs, ARM)

common test case set, etc.

This would be awesome

Back to your question, if your time permits, it is recommended that you wait until our device abstraction layer is completed before integrating PyTorch

I understand but it is problematic. Since if I wait for an API to be finalised I'll wait forever ;-).

What is expected to change? The most critical part and most of the work are the operators implemented.

Aug 07 '24 05:08 artyom-beilis

@artyom-beilis Hi, I will get back to you tomorrow, sorry for the inconvenience as I'm a bit busy lately.

Aug 08 '24 12:08 FFFrog

From my point of view it was mostly implementing operators - while biggest problem was to understand which are required, basic and what conditions are required - sometimes lack of documentation (for example what is the difference between _copy_from and _copy_from_and_resize ???)

Yes, there are many operators in Pytorch and some of them are very similar to each other; we can simply divide all operators into two parts;

Factory operators: all operators related to tensor creation, conversion, etc.
Computational operators: all operators that deal with tensors

We will provide reference implementation and documents about all factory operators but not the latter because the latter is easy to understand

There are two things - operators that can be implemented in terms of others - it would be nice to have some kind of operator tree that would show native operators and ones implemented in terms of others.

Yes, PyTorch has many operators that are composed of other basic operators, we can provide a tree list as you described above, but there are some issues with timeliness because the relationship between operators will be updated An approach maybe can give you a hand:

Compile Pytorch With DEBUG
export TORCH_SHOW_DISPATCH_TRACE=1
python -c "import torch; torch.rand(3,3)"

then, you will get the backstrace on operators like below, and will know which operator need to be implemented:

 [call] op=[aten::rand], key=[BackendSelect]
  [redispatch] op=[aten::rand], key=[CPU]
   [call] op=[aten::empty.memory_format], key=[BackendSelect]
    [redispatch] op=[aten::empty.memory_format], key=[CPU]
   [call] op=[aten::uniform_], key=[CPU]

Aug 09 '24 13:08 FFFrog

Regarding code-gen - do you mean automatic kernel code generation or building operators in terms of other operators?

As for codegen, it will generate a lot of codes according to our requirements, including but not limited to forward operator registration, backward operator registration, custom operator routing files, etc.

All you need to do is implemant operators related to specific backend and also need to provide a yaml, in which providing opeartor info, the codegen will automaticly generate all codes you need.

What I would like to add is that codegen is under development and the design of yaml has not yet been determined, you can take PyTorch yaml as an reference now.

Aug 09 '24 13:08 FFFrog

Pool was probably the trickiest part to implement and still it is sub-optimal in terms of memory allocation. But still there are many interesting points to consider that aren't similar to CUDA: in OpenCL for example you can't use pointer arithmetics on host as in CUDA - you need to add an offset or use sub-buffers, some Integrated GPU devices share memory with CPU (Intel , AMD APUs, ARM)

I absolutely agree with you. We will plan to provide several basic memory pool strategies such as CUDA Memory Pool、OpenCL Memory Pool and etc. The new backend can use the most appropriate strategy to implement the new allocator according to the characteristics of the new backend. Of course, the new backend can also implement its own memory pool from scratch.

Aug 09 '24 14:08 FFFrog

@albanD, sorry to bother you again. It seems that you are working on accelerator diversity in PyTorch, and perhaps this project could help you accelerate this goal, can you scan the project quickly and give us some advices ?

Aug 12 '24 14:08 FFFrog

We will provide reference implementation and documents about all factory operators but not the latter because the latter is easy to understand

That would be fantastic... Because sometimes it just makes me wonder what this operator is doing and under what conditions.

All you need to do is implemant operators related to specific backend and also need to provide a yaml,

This is first time I hear of the yaml...

We will plan to provide several basic memory pool strategies such as CUDA Memory Pool、OpenCL Memory Pool

Yes, this would be nice, because currently pytorch opencl/dlprimitives backend suffers of much more extensive memory use that it should.

perhaps this project could help you accelerate this goal, can you scan the project quickly and give us some advices ?

Probably what I most needed is some kind of location were you can actually ask some questions for some stuff that isn't easy to understand so far @albanD did an amazing job helping with OpenCL backend. Currently I mostly ask questions on dev-discuss but sometimes I feel that there a lots of stuff (currently stuck on torch.load... without moving model to CPU and back to device)

Thanks a lot!

Aug 12 '24 19:08 artyom-beilis

Our extreme goal is that new backend can be integrated into PyTorch smoothly by implementing some APIs and structures related with specific backend without considering any detailed stuff related to PyTorch，such as cpu fallback(@albanD have done it for dlprimitive),backend renaming and etc.

Of course, it is not always convenient for all backends to integrate into PyTorch by using this project， if the new backend doesn't involve many PyTorch features, doing it from scratch maybe also a good choice.

By the way, if you have any questions about new backend integration, you can also mention me or file a issue in the project, i am very glad to share with everyone.

Aug 13 '24 01:08 FFFrog

rfcs rfcs copied to clipboard

RFC-0037-Interoperability-Standard-of-3rd-Backend-Integration-Mechanism

rfcs
rfcs copied to clipboard