[NVIDIA TF] Part 1: Stream executor supports cudnn matmul fusion
This PR enables the cudnn matmul fusion backend for supporting the generic matmul fusion patterns. Specifically, this PR focuses on the matmul+bias+gelu_exact pattern. (Note, the matmul+bias+gelu_approximate has already been supported by cublasLt backend. See https://github.com/tensorflow/tensorflow/pull/55966)
Part 1: Stream executor supports cudnn matmul fusion. (This one) Part 2: Fused matmul op supports cudnn matmul fusion. Part 3: Grappler graph pass supports matmul+bias+gelu_exact.
cc. @nluehr @pjannaty
Can we also hyper-link the descendent PRs? i.e. Part 2: Fused matmul op supports cudnn matmul fusion. and in the other PRs as well for ease of navigating.
cc @benbarsdell
Rebased and marked as "Ready to review".
The rebase is finished. @ezhulenev and @reedwm to review. Thx.
Since the stream executor is being moved out, I have to rebase the PRs more frequently than before. The rebase is done. @ezhulenev and @reedwm to review. Thx.