Tim Moon comments

Results 227 comments of


                                            Tim Moon

[ERROR] cannot install the package,

It's odd that it didn't fail when it searches for cuBLAS: https://github.com/NVIDIA/TransformerEngine/blob/115a27ef2b7d206f8fc6634cfdec692913578ffc/transformer_engine/CMakeLists.txt#L22 Also, the cuBLAS pip wheel is intended for runtime use and doesn't include developer tools (https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#pip-wheels). Building TE...

FSDP support

In the current scheme, Transformer Engine modules use standard parameter tensors in standard dtypes (FP32/BF16/FP16). Optimizers typically require higher precision than FP8 to get good learning behavior. I don't see...

FSDP support

Yep, I used [PyTorch FSDP](https://pytorch.org/docs/stable/fsdp.html) with TE FP8. Be advised I haven't done full convergence experiments, just some basic sanity checking.

FSDP support

Transformer Engine manages FP8 casting internally (see [`transformer_engine.pytorch.fp8_autocast`](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/api/pytorch.html?highlight=autocast#transformer_engine.pytorch.fp8_autocast)) and it can run into problems when combined with other mixed precision tools like [`torch.autocast`](https://pytorch.org/docs/stable/amp.html#torch.autocast) or [`torch.distributed.fsdp.MixedPrecision`](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.MixedPrecision). For the moment, FSDP mixed...

``fp8_group`` when using FSDP and tensor parallelism

The process group for FP8 amax reductions (`fp8_group`) should be the combination of the data-parallel and tensor-parallel groups, which is the world group in your use-case. This is because the...

Version constraint of `flash-attn` needs to be updated

Flash Attention is being rapidly developed and its API is somewhat unstable. We've found it safer to only bump the version constraint after validating that Flash Attention works as expected....

main branch cannot compile due to incompatibility with the main branch of cudnn-frontend

We currently pin the cuDNN front-end to the 1.0.3 release. I don't expect to see much benefit from updating to the bleeding edge since it is mostly just a wrapper...

Doesn't work on wsl2

I haven't tried running on WSL, although I see in [this guide](https://docs.nvidia.com/cuda/wsl-user-guide/index.html#cuda-support-for-wsl-2) that there are some traps related to `libcuda.so`. My hunch is that cuDNN can't find the right `libcuda.so`...

PIP Installation Failed

@mahdip72 It looks CMake is having trouble finding your C++ compiler and your CUDA installation. Can you try setting the `CXX` and `CUDA_PATH` environment variables? @markusheimerl The best way to...

question for building wheel for transformer-engine

Please try these suggestions: https://github.com/NVIDIA/TransformerEngine/issues/355#issuecomment-2394353816 It may also be worth considering using an [NGC PyTorch container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch), which includes TE.