Multi-head attention?

Open ToucheSir opened this issue 3 years ago • 2 comments

Now that we're seeing models with attention layers in more than one downstream (e.g. FluxML/Metalhead.jl#105 and Transformers.jl), it may be time to consider pulling some building blocks into NNlib. CUDA.jl already wraps cuDNN's MHA too, see https://github.com/JuliaGPU/CUDA.jl/blob/27c87a6f261aa7964d797e8fe4bf33b46c1a185e/test/cudnn/multiheadattn.jl#L55.

Feb 08 '22 19:02 ToucheSir

Another case like recurrence where the cuDNN API doesn't match the cute semantics that we use to define the layer. One thing that the Metalhead PR made me consider was adding the ability to thread Parallel when the branches are expensive and equally computationally intensive.

Feb 09 '22 21:02 darsnack

That's an interesting idea. I think you could extend it to using separate CUDA streams for GPU-tasks too. Making AD co-operate would be its own challenge of course :sweat_smile:

Feb 09 '22 23:02 ToucheSir