kineto
kineto copied to clipboard
[RFC] Make the kineto extendable for other runtime than CUDA
🚀 Feature
This RFC proposes to add a profile abstract shim layer to Kineto for plug-in to extend. They align with CUPTI. The common structure of the profiling information can be shared. And processes the profiling information with same post-analysis . To enable new language profiling, minimal changes are required in Kineto packages.
Motivation
The Pytorch uses the Kineto as the new profiler middle ware. But the CUPTI (ROCm corresponding for AMD GPU) is the only supported backend for Pytorch CUDA in Kineto. As the Pytorch supports more and more new devices other than CUDA, like XPU for Intel's GPU which is based on Intel's SYCL, it becomes a problem that the Kineto is coupled CUPTI tightly and hard to support new device type in Pytorch. The easy extendable Kineto is required for collection profiling information on different runtime.
Pitch
An extendable Kineto should support those extensions of other languages which operates like a plug-in, such as SYCL on XPU backend. By abstracting the CuptiActivityProfiler class to offer a registrable interface for backends to override, the Kineto could support different backends other than CUDA/ROCm. PyTorch can pass the device message (aka. DeviceType) to Kineto and let Kineto choose different path for each backend. This puts requirements for the universality of some core methods such as startTrace/stopTrace/processTrace. Seperately, for XPU path, we could override those API functions at our Intel extension side outside of Kineto or PyTorch, and for CUDA/ROCm path, all things will remain as normal. For further, this abstraction won't affect all the build. That means, with or without XPU, the build will always pass as usual. The only new feature is, without XPU build in, the profiler will offer a warning for XPU users.
To describe the above design, we offer a suspected structure in this picture.
For this design, we need:
- Abstract the middle class - ActivityProfilerBase
- Add DeviceType pass as an argument(not show in the picture)
- Abstract core functions within ActivityProfilerBase class
- Bypass XPU path with existed paths within CuptiActivityProfiler class
Additional context
Open issues:
Discuss and choose a proper way offered above to support Intel extension for PyTorch
@gdankel Would you please help review this RFC? I think you should be an expert on this structure because you contributed most of the code on the CuptiActivityProfiler class. Would you please provide your suggestions on this RFC?
@robieta @gdankel @aaronenyeshi @malfet @ezyang
Could you help review this RFC? Or, tell us who should review this RFC? We are going to support PyTorch out-tree device, like XPU, in kineto with this RFC.
@adnanaziz might be able to get the right reviewers on here. Thanks.
Thanks @orionr
@adnanaziz Could you help review this RFC, and give comments?
I'll be happy to review - just need to find some time, please be a little patient while I refresh my memory ;)
@gdankel @adnanaziz May I know if there is any feedback from you on this RFC? I am afraid that if the progress is moving too slow, we cannot catch up the 1.13 release of PyTorch to pull in this feature for XPU.
Hi, I seem to recall some refactoring diffs not yet published to GitHub that addresses most of this, or maybe they were published but I missed them in my search. I would suggest landing these first since they should go a long way towards addressing this at the structural level. @chaekit you may be able to check?
@robieta @aaronenyeshi @chaekit @briancoutinho Could one of you dig up the other refactoring patch(es) and post them to github?
Specifically I introduced a composite profiler class, removed the ActivityProfilerProxy etc. I think it's better to discuss this in context of that.
@gdankel @robieta @aaronenyeshi @chaekit @briancoutinho Could you give more suggestions on this RFC? Then, we will follow to implement it based-on your comments.
Hi @gdankel - I will look for the refactoring patch. We were not able to land all the refactoring patches at the time.
Here is the first one: https://github.com/pytorch/kineto/pull/698
I'll add some high-level description to that one (need to refresh my memory as well) and then we can see how it fits in with your needs and this RFC.
Thanks for the comments. @gdankel @xunsongh pls help review and give your feedbacks.
Here is the first one: #698
I'll add some high-level description to that one (need to refresh my memory as well) and then we can see how it fits in with your needs and this RFC.
Yes, I'm reviewing and will give my suggestions according to these modifications at my side in time. Please let me pay some time on understanding this PR.
I've finally gotten around to adding some high-level description. This is still WIP so a few more patches are expected after #698. But hopefully you should be able to see whether this will do what you need in principle. If you need to move faster then I'm OK with you landing some intermediate refactoring diffs as long as they move it in a direction we can agree on.
@gdankel Hi, and sorry for late replying. I have fully reviewed your PR https://github.com/pytorch/kineto/pull/698 and yes it looks very good. But I still have a big requirement which did not implemented in this PR. We XPU need a register hook in profiler initialization path as our extension is lazy installed and initialized after PyTorch was installed, which is very different than pre-installed CUPTI. That means Kineto must offer a method for our extensions to lazy register our profiler session into kineto to override necessary struct (IActivityProfiler) and necessary methods (startTrace, stopTrace, etc.). Also, this modification of top layer of interfaces must be able to co-work with change in PyTorch (https://github.com/pytorch/pytorch/pull/94502). And this task is a block issue for our extension's development. So, would you please firstly offer an opinion on this refinement? Or, let me push a pull request to adjust the hook and ask your review? After this hook is implemented, I think, you have more abundant time to optimize the bottom layer inside Kineto. Wish a quick reply from you and thank you very much. FWD to @aaronenyeshi @gujinghui
I've finally gotten around to adding some high-level description. This is still WIP so a few more patches are expected after #698. But hopefully you should be able to see whether this will do what you need in principle. If you need to move faster then I'm OK with you landing some intermediate refactoring diffs as long as they move it in a direction we can agree on.
Thanks so much for comments. And very sorry for slow response. Yes, we indeed need to speedup the XPU support in kineto. We will follow your PR to see which changes are needed, split and submit PR, then. Is it OK to you?
@gdankel Hi, and sorry for late replying. I have fully reviewed your PR #698 and yes it looks very good. But I still have a big requirement which did not implemented in this PR. We XPU need a register hook in profiler initialization path as our extension is lazy installed and initialized after PyTorch was installed, which is very different than pre-installed CUPTI. That means Kineto must offer a method for our extensions to lazy register our profiler session into kineto to override necessary struct (IActivityProfiler) and necessary methods (startTrace, stopTrace, etc.).
I'm not sure I understand exactly - so to register the profiler you can call activityProfiler().registerProfiler() to register a factory... this factory can be a stub if you like that does nothing until you want it to, at which point it can lazily load libraries or whatever else. Is there something else missing?
@gdankel Thanks for your comments. We will try to follow this API to confirm if everything works for us. Thanks a lot.
Alright. I will follow