Tim Moon

Results 227 comments of Tim Moon

It's odd that it didn't fail when it searches for cuBLAS: https://github.com/NVIDIA/TransformerEngine/blob/115a27ef2b7d206f8fc6634cfdec692913578ffc/transformer_engine/CMakeLists.txt#L22 Also, the cuBLAS pip wheel is intended for runtime use and doesn't include developer tools (https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#pip-wheels). Building TE...

In the current scheme, Transformer Engine modules use standard parameter tensors in standard dtypes (FP32/BF16/FP16). Optimizers typically require higher precision than FP8 to get good learning behavior. I don't see...

Yep, I used [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html) with TE FP8. Be advised I haven't done full convergence experiments, just some basic sanity checking.

Transformer Engine manages FP8 casting internally (see [`transformer_engine.pytorch.fp8_autocast`](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/api/pytorch.html?highlight=autocast#transformer_engine.pytorch.fp8_autocast)) and it can run into problems when combined with other mixed precision tools like [`torch.autocast`](https://pytorch.org/docs/stable/amp.html#torch.autocast) or [`torch.distributed.fsdp.MixedPrecision`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.MixedPrecision). For the moment, FSDP mixed...

The process group for FP8 amax reductions (`fp8_group`) should be the combination of the data-parallel and tensor-parallel groups, which is the world group in your use-case. This is because the...

Flash Attention is being rapidly developed and its API is somewhat unstable. We've found it safer to only bump the version constraint after validating that Flash Attention works as expected....

We currently pin the cuDNN front-end to the 1.0.3 release. I don't expect to see much benefit from updating to the bleeding edge since it is mostly just a wrapper...

I haven't tried running on WSL, although I see in [this guide](https://docs.nvidia.com/cuda/wsl-user-guide/index.html#cuda-support-for-wsl-2) that there are some traps related to `libcuda.so`. My hunch is that cuDNN can't find the right `libcuda.so`...

@mahdip72 It looks CMake is having trouble finding your C++ compiler and your CUDA installation. Can you try setting the `CXX` and `CUDA_PATH` environment variables? @markusheimerl The best way to...

Please try these suggestions: https://github.com/NVIDIA/TransformerEngine/issues/355#issuecomment-2394353816 It may also be worth considering using an [NGC PyTorch container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch), which includes TE.