tensorflow [NVIDIA TF] Part 1: Stream executor supports cudnn matmul fusion

This PR enables the cudnn matmul fusion backend for supporting the generic matmul fusion patterns. Specifically, this PR focuses on the matmul+bias+gelu_exact pattern. (Note, the matmul+bias+gelu_approximate has already been supported by cublasLt backend. See https://github.com/tensorflow/tensorflow/pull/55966)

Part 1: Stream executor supports cudnn matmul fusion. (This one) Part 2: Fused matmul op supports cudnn matmul fusion. Part 3: Grappler graph pass supports matmul+bias+gelu_exact.

cc. @nluehr @pjannaty

Jul 19 '22 22:07 kaixih

Can we also hyper-link the descendent PRs? i.e. Part 2: Fused matmul op supports cudnn matmul fusion. and in the other PRs as well for ease of navigating.

Jul 20 '22 17:07 pjannaty

cc @benbarsdell

Jul 25 '22 22:07 kaixih

Rebased and marked as "Ready to review".

Jul 27 '22 17:07 kaixih

The rebase is finished. @ezhulenev and @reedwm to review. Thx.

Aug 18 '22 20:08 kaixih

Since the stream executor is being moved out, I have to rebase the PRs more frequently than before. The rebase is done. @ezhulenev and @reedwm to review. Thx.

Aug 25 '22 00:08 kaixih