TransformerEngine Get Stuck at Building Wheel

Hi, anyone faces the problem of installation gets stuck at building wheel?

Jun 27 '24 21:06 kingformatty

Can you share more information on your configuration, especially which DL framework you're building with? Passing the --verbose flag to pip install would also provide more useful build logs. A hang makes me suspect your system is over-parallelizing the build process:

If the hang happens while building Flash Attention or transformer_engine_torch, then it's a failure while building a PyTorch extension. Try setting MAX_JOBS=1 in the environment (see this note). Note that building Flash Attention is especially resource-intensive and can experience problems even on relatively powerful systems.
If the hang happens in CMake, then it's a failure in a Ninja build. We currently don't have a nice way to reduce the number of parallel Ninja jobs, but it is something we should prioritize if it is causing a problem (pinging @phu0ngng). You could try setting CMAKE_BUILD_PARALLEL_LEVEL=1 in the environment.

Jun 27 '24 22:06 timmoon10

With https://github.com/NVIDIA/TransformerEngine/pull/987, you can control the number of parallel build jobs with the MAX_JOBS environment variable.

Jul 12 '24 18:07 timmoon10

Same problem. Especially, got stuck in Running command /usr/lib/cmake-3.22.6-linux-x86_64/bin/cmake --build /opt/tiger/TransformerEngine/build/cmake --parallel 1

Aug 01 '24 13:08 ZSL98

Hm, I'd expect most systems could handle building with MAX_JOBS=1. I wonder if we could get more clues if you build with verbose output (pip install -v -v .).

Aug 07 '24 23:08 timmoon10

I have a similar problem. With MAX_JOBS=1 it gets stuck after 6/24 and otherwise it gets stuck after 8/24 building transpose_fusion.cu.o. My whole computer gets frozen and I have to reboot manually. I use Cuda 12.5 and I have a rtx 3060. I also tried to limitate the number of threads with export MAKEFLAGS="-j2" but without success.

CMake Warning: Manually-specified variables were not used by the project:
  pybind11_DIR
-- Build files have been written to: /home/adrlfv/Téléchargements/TransformerEngine/build/cmake Running command /usr/bin/cmake --build /home/adrlfv/Téléchargements/TransformerEngine/build/cmake [1/32] Building CXX object CMakeFiles/transformer_engine.dir/transformer_engine.cpp.o [2/32] Building CUDA object CMakeFiles/transformer_engine.dir/gemm/cublaslt_gemm.cu.o [3/32] Building CXX object CMakeFiles/transformer_engine.dir/layer_norm/ln_api.cpp.o [4/32] Building CUDA object CMakeFiles/transformer_engine.dir/transpose/transpose.cu.o [5/32] Building CUDA object CMakeFiles/transformer_engine.dir/fused_attn/fused_attn.cpp.o [6/32] Building CXX object CMakeFiles/transformer_engine.dir/rmsnorm/rmsnorm_api.cpp.o [7/32] Building CUDA object CMakeFiles/transformer_engine.dir/transpose/cast_transpose.cu.o [8/32] Building CUDA object CMakeFiles/transformer_engine.dir/transpose/transpose_fusion.cu.o

Aug 14 '24 20:08 AdrLfv

To give people an idea, the default build of flash attention by itself on a 32/64 core threadripper pro 5975WX with 512GB of ram on older versions of the makefile that specified NVCC_THREADS=4 peaks at 260GB of ram use and takes something like 6 minutes. If you don't have that much ram it'll thrash too much to build within reasonable time most likely. The current default build should be around 128GB of ram use which is more than most motherboards support. It'll be un-buildable on this machine as soon as I install CUDA 12.8 and it defaults to building 4 different architectures (none of which apply to me). Most of the built files complete quickly, but the terminal output will appear completely frozen until the very long-running one is done, and some of the memory might not be freed until it's done which will mess up other short running jobs if you have low ram.

By default even if MAX_JOBS is set, the flash attention build will pass in --nvcc_threads=2 to the toolchain, which in practice seems to double the amount of memory used since it'll try to parallel build multiple architectures unless you've turned them off. The environment variable NVCC_THREADS=1 will fix that. You should still expect high ram usage, but if you have a reasonably high ram machine that isn't as insane as mine doing that should allow you to set the arch as below and skip the MAX_JOBS line. In my experience it needs core count * 1GB * (the lesser of NVCC_THREADS or count of architectures) of free memory to build, but sometimes NVCC_THREADS=2 will still eat memory building even if there's only one arch; it's only supposed to thread architectures and there's really only one so I suspect some subprocess isn't freeing memory.

Unfortunately the selection of architectures to build is hardcoded to the server variants starting at A100; most mere humans can't afford any of the server modules and will never be able to without scamming some serious grant money out of somebody. For that matter most people can't afford half of the ada or blackwell "consumer" lineups, but that's another story. Building ptx for 80,90 (and 100 and 120 after installing CUDA 12.8) is pointless. If you're on ampere you can cut down the flashattention setup.py lines at around line 178 to:

cc_flag.append("-arch") cc_flag.append("sm_86") cc_flag.append("-gencode") cc_flag.append("arch=compute_86,code=sm_86")

I'm not sure if the "-arch" flag and the next line are actually necessary since the build file doesn't seem to set it, and online info suggests that a PTX arch should additionally be set with

cc_flag.append("-gencode") cc_flag.append("arch=compute_86,code=compute_86")

But I don't know how necessary this is if you're only running binary code on a single arch.

With ada it's: cc_flag.append("-arch") cc_flag.append("sm_89") cc_flag.append("-gencode") cc_flag.append("arch=compute_89,code=sm_89")

The environment variable FLASH_ATTN_CUDA_ARCHS to set this won't work. For some reason it's just used as a hardcoded check in the series of if clauses that look for the cuda version and 80,90,100,and 120. Setting to just 80 isn't the end of the world for consumer ampere, but setting to 90 won't enable anything new that ada and hopper both support since the versions aren't backwards compatible.

I haven't looked much into what the transformer engine build system might be doing that chews up memory or what variables need to be set since it doesn't build on Windows.

Jan 29 '25 06:01 NeedsMoar

more simple:

MAX_JOBS=12 \
NVTE_FRAMEWORK=pytorch \
NVTE_CUDA_ARCHS=120 \
python3 setup.py bdist_wheel --dist-dir=/opt/transformer_engine/wheels
pip3 install --no-cache-dir --verbose /opt/transformer_engine/wheels/transformer_engine*.whl

Jan 29 '25 14:01 johnnynunez

@johnnynunez NVTE_CUDA_ARCHS must be 120 instead of 12.0.

Jan 29 '25 14:01 ksivaman

@ksivaman yeah, I wrote quickly from my phone sorry

Jan 29 '25 14:01 johnnynunez