Kaixi Hou issues

Results 15 issues of


                                            Kaixi Hou

[NVIDIA TF] Revert TF32+NHWC changes

After some more extensive tests, we think we need to revert those PRs regarding the NHWC+TF32 changes: https://github.com/tensorflow/tensorflow/pull/55761 https://github.com/tensorflow/tensorflow/pull/55806 https://github.com/tensorflow/tensorflow/pull/55920 During the tests, we noticed that in some cases the...

awaiting review

ready to pull

size:M

[NVIDIA TF] Support MatMul+BiasAdd+Tanh/Sigmoid cuDNN runtime fusion

This PR enables the cudnn matmul fusion backend for supporting the generic matmul fusion patterns. Specifically, this PR focuses on the matmul+bias+tanh|sigmoid pattern. This PR is on top of these...

awaiting review

size:XL

comp:core

[NVIDIA TF] Part 2: Fused matmul op supports cudnn matmul fusion

This PR enables the cudnn matmul fusion backend for supporting the generic matmul fusion patterns. Specifically, this PR focuses on the matmul+bias+gelu_exact pattern. (Note, the matmul+bias+gelu_approximate has already been supported...

awaiting review

size:L

comp:core

[NVIDIA TF] Part 1: Stream executor supports cudnn matmul fusion

awaiting review

size:L

Support Cudnn Runtime Fusion (Conv+Bias+Relu6/Elu/LeakyRelu)

This PR adds the support of fusion patterns of Conv+Bias+Relu6/Elu/LeakyRelu on GPUs. This is realized by using the CuDNN graph API which can utilize the runtime compiled kernels for Ampere...

awaiting review

ready to pull

size:L

Kaixi Hou

[NVIDIA TF] Revert TF32+NHWC changes

[NVIDIA TF] Support MatMul+BiasAdd+Tanh/Sigmoid cuDNN runtime fusion

[NVIDIA TF] Part 2: Fused matmul op supports cudnn matmul fusion

[NVIDIA TF] Part 1: Stream executor supports cudnn matmul fusion

Support Cudnn Runtime Fusion (Conv+Bias+Relu6/Elu/LeakyRelu)

[NVIDIA] Use custom grad accumulation for FP8 params

[NVIDIA] Add config option to use cudnn flash attention

[NVIDIA] Add new SDPA API to jax.nn

[NVIDIA] Rename fp8 custom dtype to `fp32_max_grad`

[NVIDIA] Don't use c_scale when the operand c is non-fp8