lightning-thunder icon indicating copy to clipboard operation
lightning-thunder copied to clipboard

Support for CUDA kernels

Open Andrei-Aksionov opened this issue 11 months ago • 6 comments

🚀 Feature

Hi there 👋

From the main readme file I noticed that Thunder except custom kernels, but only the ones that are written in Trition. Is there a plan to support CUDA kernels?

Motivation

I'm only in the beginning of the custom kernels journey, so I might misunderstand something.

From what I saw online, there are many of highly optimized CUDA kernels already available (since CUDA has been around for quite a while). Plus, there is a high chance that someone with a lot of experience in writing CUDA kernels (but not Trition) want's to use Thunder (or even integrate into an existing project).

I personally would like to write custom CUDA kernels for the LitGPT repo after I finish reading PMPP book.

Andrei-Aksionov avatar Mar 25 '24 08:03 Andrei-Aksionov

Hello Andrei,

Thunder can work with any custom kernel, not just the ones written in Triton. Any function that accepts and returns PyTorch tensors can be registered to work with Thunder. Here's a tutorial on connecting CUDA kernels with the PyTorch interface: https://pytorch.org/tutorials/advanced/cpp_extension.html Once registered in PyTorch these CUDA extensions can be registered in Thunder with OperatorExecutor.register_implementation.

We have one example executor using cross_entropy CUDA kernel from the Apex project:

  • Executor code: https://github.com/Lightning-AI/lightning-thunder/blob/main/thunder/executors/apex_entropyex.py
  • CUDA kernel code: https://github.com/NVIDIA/apex/blob/master/apex/contrib/csrc/xentropy/xentropy_kernel.cu
  • PyTorch C++ interface: https://github.com/NVIDIA/apex/blob/master/apex/contrib/csrc/xentropy/interface.cpp

IvanYashchuk avatar Mar 25 '24 12:03 IvanYashchuk

Hey Ivan,

Thanks for the answer.

Any function that accepts and returns PyTorch tensors can be registered to work with Thunder.

Sounds promising.

Maybe the Readme file should reflect this too, what do you think?

...

- TransformerEngine
- PyTorch eager
- custom kernels, including those written with OpenAI Triton and Nvidia CUDA

In fact, any function that accepts and returns PyTorch tensors can be registered to work with Thunder, what makes it compatible with any custom kernel.

...

?

Andrei-Aksionov avatar Mar 25 '24 13:03 Andrei-Aksionov

@IvanYashchuk I was thinking whether it would make sense to have a PyCUDA option as well. This would give us an option to be more decoupled from PyTorch's extension mechanism (and the C++ ABI).

lantiga avatar Mar 25 '24 13:03 lantiga

PyTorch supports CUDA Array Interface and any project that works with this interface can accept and write to PyTorch Tensors including PyCUDA (https://documen.tician.de/pycuda/tutorial.html#interoperability-with-other-libraries-using-the-cuda-array-interface), Numba (https://numba.readthedocs.io/en/stable/cuda/kernels.html), CuPy (https://docs.cupy.dev/en/stable/user_guide/kernel.html), and others.

IvanYashchuk avatar Mar 25 '24 13:03 IvanYashchuk

That's right! I think a tutorial that shows how to add a CUDA kernel using the CUDA array interface without necessarily having to build a PyTorch extension would be great /cc @t-vi

lantiga avatar Mar 25 '24 14:03 lantiga

Yeah, if anyone has suggestions for a great cuda kernel, I'll take them, or I ask the people on cuda mode...

t-vi avatar Mar 25 '24 14:03 t-vi

I have made the demo for this weeks cuda mode lecture with cuda-python and that seemed to work well enough that I'd make it into a Thunder example.

t-vi avatar Apr 01 '24 10:04 t-vi

You are talking about Flash Attention lecture (haven't seen it yet)? If it's, so I think it would be a cool (and somewhat flashy) example.

Andrei-Aksionov avatar Apr 01 '24 11:04 Andrei-Aksionov