Tim Moon
Tim Moon
PyTorch FSDP gathers the module params before each forward and backward so that module implementations can just access them like normal. I wonder if your framework could use a similar...
/te-ci pytorch
We are working on a tutorial for inference with Gemma: https://github.com/NVIDIA/TransformerEngine/blob/5cb8ed4d129245357363361947e5b1d31c543783/docs/examples/te_gemma/tutorial_generation_gemma_with_te.ipynb. We're still tweaking it, so we'd appreciate any feedback at https://github.com/NVIDIA/TransformerEngine/pull/829.
I see [`CUDA::nvToolsExt`](https://cmake.org/cmake/help/latest/module/FindCUDAToolkit.html#nvtoolsext) is deprecated as of CMake 3.25, but I don't see any indication that it's been removed. I see you're building with CMake 3.29, but I also build...
- CMake is unable to find a C++ compiler in the usual places (e.g. `/usr/bin/c++`). Try setting `CXX` in the environment to the path of your compiler (we usually build...
@s-smits Thanks for bringing this up. We bumped the minimum CUDA version in TE 1.10 (see https://github.com/NVIDIA/TransformerEngine/pull/1103). I've updated my previous comment.
/te-ci pytorch L0 L1
We use Ninja to parallelize the build process and I suspect it's overwhelming your system resources. Can you try running with `MAX_JOBS=1` in your environment?
When building with minimal resource requirements, we now recommend setting `MAX_JOBS=1` and `NVTE_BUILD_THREADS_PER_JOB=1` in the environment. This will of course drastically slow down the build process. Setting `NVTE_CUDA_ARCHS` to your...
As mentioned by @ptrendx, we'll need to include these tests in one of the QA scripts (see [`qa`](https://github.com/NVIDIA/TransformerEngine/tree/main/qa)) so that they are included in the CI pipelines. [`L1_pytorch_distributed_unittest`](https://github.com/NVIDIA/TransformerEngine/tree/main/qa/L1_pytorch_distributed_unittest) is simplest,...