Multi-head attention?
Now that we're seeing models with attention layers in more than one downstream (e.g. FluxML/Metalhead.jl#105 and Transformers.jl), it may be time to consider pulling some building blocks into NNlib. CUDA.jl already wraps cuDNN's MHA too, see https://github.com/JuliaGPU/CUDA.jl/blob/27c87a6f261aa7964d797e8fe4bf33b46c1a185e/test/cudnn/multiheadattn.jl#L55.
Another case like recurrence where the cuDNN API doesn't match the cute semantics that we use to define the layer. One thing that the Metalhead PR made me consider was adding the ability to thread Parallel when the branches are expensive and equally computationally intensive.
That's an interesting idea. I think you could extend it to using separate CUDA streams for GPU-tasks too. Making AD co-operate would be its own challenge of course :sweat_smile: