Tim Moon comments

Results 227 comments of


                                            Tim Moon

How can we use te.Linear with weight parallel?

PyTorch FSDP gathers the module params before each forward and backward so that module implementations can just access them like normal. I wonder if your framework could use a similar...

Warn when using fp8 weights + non-fp8 computation

/te-ci pytorch

How to use FP8 of TransformerEngine in inference

We are working on a tutorial for inference with Gemma: https://github.com/NVIDIA/TransformerEngine/blob/5cb8ed4d129245357363361947e5b1d31c543783/docs/examples/te_gemma/tutorial_generation_gemma_with_te.ipynb. We're still tweaking it, so we'd appreciate any feedback at https://github.com/NVIDIA/TransformerEngine/pull/829.

Can't find `nvToolsExt` during build

I see [`CUDA::nvToolsExt`](https://cmake.org/cmake/help/latest/module/FindCUDAToolkit.html#nvtoolsext) is deprecated as of CMake 3.25, but I don't see any indication that it's been removed. I see you're building with CMake 3.29, but I also build...

Bulid transformer enginer is failed caused by cmake command error!

- CMake is unable to find a C++ compiler in the usual places (e.g. `/usr/bin/c++`). Try setting `CXX` in the environment to the path of your compiler (we usually build...

Bulid transformer enginer is failed caused by cmake command error!

@s-smits Thanks for bringing this up. We bumped the minimum CUDA version in TE 1.10 (see https://github.com/NVIDIA/TransformerEngine/pull/1103). I've updated my previous comment.

[PyTorch] Bugfix for wgrad bulk overlap conflict when dgrad overlap is reduce-scatter

/te-ci pytorch L0 L1

stuck at building wheel

We use Ninja to parallelize the build process and I suspect it's overwhelming your system resources. Can you try running with `MAX_JOBS=1` in your environment?

stuck at building wheel

When building with minimal resource requirements, we now recommend setting `MAX_JOBS=1` and `NVTE_BUILD_THREADS_PER_JOB=1` in the environment. This will of course drastically slow down the build process. Setting `NVTE_CUDA_ARCHS` to your...

Tests for distributed

As mentioned by @ptrendx, we'll need to include these tests in one of the QA scripts (see [`qa`](https://github.com/NVIDIA/TransformerEngine/tree/main/qa)) so that they are included in the CI pipelines. [`L1_pytorch_distributed_unittest`](https://github.com/NVIDIA/TransformerEngine/tree/main/qa/L1_pytorch_distributed_unittest) is simplest,...