Tim Moon
Tim Moon
It looks like PyTorch's C++ extensions are configured with `CUDNN_HOME` or `CUDNN_PATH`: https://github.com/pytorch/pytorch/blob/5a80d2df844f9794b3b7ad91eddc7ba762760ad0/torch/utils/cpp_extension.py#L209 PyTorch's build is configured with `CUDNN_ROOT`: https://github.com/pytorch/pytorch/blob/5a80d2df844f9794b3b7ad91eddc7ba762760ad0/cmake/Modules_CUDA_fix/FindCUDNN.cmake#L4
```bash export CUDNN_PATH=/path/to/cudnn pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable ```
This bug should be fixed with https://github.com/NVIDIA/TransformerEngine/pull/1335, which is included in Transformer Engine 2.0.
/te-ci L1
/te-ci L1
/te-ci L1 pytorch
/te-ci pytorch L1
/te-ci pytorch L1
We have used `torch.compile` to fuse some operations like bias+GeLU in `LayerNormMLP` (see [`bias_gelu_fused_`](https://github.com/NVIDIA/TransformerEngine/blob/b36bd0a458424eac939669ae05231726b3461b0d/transformer_engine/pytorch/jit.py#L60)). However, we have not yet done serious work applying `torch.compile` to FP8 kernels since we're not...
Matmuls are ideal for FP8 compute since they can take advantage of Tensor Cores and they're less sensitive to quantization error. While other operations might benefit (especially from reduced memory...