TransformerEngine
TransformerEngine copied to clipboard
Installation failed with cmake error
Hi,
We are testing our new Hopper machines (H800/H100) and trying to use fp8 for training for the first time, but are having trouble installing TransformerEngine
. It reports RuntimeError: Error when running CMake: Command '['/usr/local/bin/cmake', '-S', '/tmp/pip-req-build-p6kjladj/transformer_engine', '-B', '/tmp/tmps08o01xi', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/tmp/pip-req-build-p6kjladj/build/lib.linux-x86_64-cpython-310', '-GNinja']' returned non-zero exit status 1.
.
We tried to invoke the command outside of pip and it just reports that there are no source directory.
We are trying docker right now but our internet configuration does not let us use docker very conveniently so we usually would prefer not use it. Could you should us where we could find any clues on how we can proceed? Much appreciated.
Hi @RuiWang1998, could you share the command you use for installation and a full error message that you are getting? Thank you!
Hi @ptrendx, we used both pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
and pip install git+https://github.com/NVIDIA/TransformerEngine.git@main
and tried python version from 3.9 to 3.11. Everytime we simply install pytorch==2.0.1
and packaging
and then ran the two commands. They both returned the same error
Hi @ptrendx, after a little digging, we think we have located the problem but not sure what's the solution here:
/usr/bin/c++ -Dtransformer_engine_EXPORTS -I/home/rui/TransformerEngine/transformer_engine -I/home/rui/TransformerEngine/transformer_engine/common/include -I/usr/local/cuda-11.8/targets/x86_64-linux/include -I/home/rui/TransformerEngine/transformer_engine/../3rdparty/cudnn-frontend/include -I/tmp/tmp9cj2vyni/common/string_headers -isystem /usr/local/cuda-11.8/include -O3 -DNDEBUG -std=gnu++17 -fPIC -MD -MT common/CMakeFiles/transformer_engine.dir/fused_attn/fused_attn.cpp.o -MF common/CMakeFiles/transformer_engine.dir/fused_attn/fused_attn.cpp.o.d -o common/CMakeFiles/transformer_engine.dir/fused_attn/fused_attn.cpp.o -c /home/rui/TransformerEngine/transformer_engine/common/fused_attn/fused_attn.cpp
In file included from /usr/local/cuda-11.8/include/cuda_fp8.h:350,
from /home/rui/TransformerEngine/transformer_engine/common/fused_attn/../common.h:14,
from /home/rui/TransformerEngine/transformer_engine/common/fused_attn/fused_attn.cpp:8:
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e5m2::operator short unsigned int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:735:16: error: ‘__half2ushort_rz’ was not declared in this scope
735 | return __half2ushort_rz(__half(*this));
| ^~~~~~~~~~~~~~~~
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e5m2::operator unsigned int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:744:16: error: ‘__half2uint_rz’ was not declared in this scope
744 | return __half2uint_rz(__half(*this));
| ^~~~~~~~~~~~~~
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e5m2::operator long long unsigned int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:753:16: error: ‘__half2ull_rz’ was not declared in this scope; did you mean ‘__half2_raw’?
753 | return __half2ull_rz(__half(*this));
| ^~~~~~~~~~~~~
| __half2_raw
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e5m2::operator short int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:791:16: error: ‘__half2short_rz’ was not declared in this scope
791 | return __half2short_rz(__half(*this));
| ^~~~~~~~~~~~~~~
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e5m2::operator int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:800:16: error: ‘__half2int_rz’ was not declared in this scope; did you mean ‘__half2_raw’?
800 | return __half2int_rz(__half(*this));
| ^~~~~~~~~~~~~
| __half2_raw
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e5m2::operator long long int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:809:16: error: ‘__half2ll_rz’ was not declared in this scope; did you mean ‘__half2_raw’?
809 | return __half2ll_rz(__half(*this));
| ^~~~~~~~~~~~
| __half2_raw
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e4m3::operator short unsigned int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:1248:16: error: ‘__half2ushort_rz’ was not declared in this scope
1248 | return __half2ushort_rz(__half(*this));
| ^~~~~~~~~~~~~~~~
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e4m3::operator unsigned int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:1257:16: error: ‘__half2uint_rz’ was not declared in this scope
1257 | return __half2uint_rz(__half(*this));
| ^~~~~~~~~~~~~~
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e4m3::operator long long unsigned int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:1266:16: error: ‘__half2ull_rz’ was not declared in this scope; did you mean ‘__half2_raw’?
1266 | return __half2ull_rz(__half(*this));
| ^~~~~~~~~~~~~
| __half2_raw
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e4m3::operator short int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:1303:16: error: ‘__half2short_rz’ was not declared in this scope
1303 | return __half2short_rz(__half(*this));
| ^~~~~~~~~~~~~~~
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e4m3::operator int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:1311:16: error: ‘__half2int_rz’ was not declared in this scope; did you mean ‘__half2_raw’?
1311 | return __half2int_rz(__half(*this));
| ^~~~~~~~~~~~~
| __half2_raw
/usr/local/cuda-11.8/include/cuda_fp8.hpp: In member function ‘__nv_fp8_e4m3::operator long long int() const’:
/usr/local/cuda-11.8/include/cuda_fp8.hpp:1319:16: error: ‘__half2ll_rz’ was not declared in this scope; did you mean ‘__half2_raw’?
1319 | return __half2ll_rz(__half(*this));
| ^~~~~~~~~~~~
| __half2_raw
ninja: build stopped: subcommand failed.
Seems like we are missing some headers, where can we include one?
We have machines with CUDA 11.8 and machines with CUDA 12 and we believe they share the same reason here.
Hi,
Some updates, our machines with H800 can successfully install now but A100 machines cannot yet. H800 machines just needed CUDNN but A100 machines, even after installation of CUDNN, still meets the error above.
Hi, this is a pretty strange error - functions like __half2ushort_rz
are declared inside the cuda_fp16.hpp
file, which should be in the include
directory in your CUDA installation (in this case /usr/local/cuda-11.8/include
or /usr/local/cuda-11.8/targets/x86_64-linux/include
). Could you confirm that such file exists there?
Hi, yes it is in /usr/local/cuda-11.8/include
and it seems that __half2ushort_rz
is declared there.
Any update on this issue?
Hi, @MicPie ,
We have been able to install this with newer commits now. Were you trying on stable releases?
I have the same problem in my workstation with A6000 ada.
raise RuntimeError(f"Error when running CMake: {e}")
RuntimeError: Error when running CMake: Command '['/usr/bin/cmake', '-S', '/tmp/pip-req-build-hnl1xnl7/transformer_engine', '-B', '/tmp/tmp6vkf06mc', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/tmp/pip-req-build-hnl1xnl7/build/lib.linux-x86_64-cpython-311']' returned non-zero exit status 1.
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for transformer-engine
@RuiWang1998 Could you help me what should I do? Install CUDNN? Cuda 11.8 pytorch 2.1.0 python 3.11 ubuntu 22.04
Hi,
You would have to modify setup.py and make it output the actual error message (maybe by manual input of commands in terminal) s.t. we can know exactly what is going on.
Best, Rui On Nov 21, 2023 at 5:05 PM +0800, mahdip72 @.***>, wrote:
I have the same problem in my workstation with A6000 ada.
raise RuntimeError(f"Error when running CMake: {e}") RuntimeError: Error when running CMake: Command '['/usr/bin/cmake', '-S', '/tmp/pip-req-build-hnl1xnl7/transformer_engine', '-B', '/tmp/tmp6vkf06mc', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/tmp/pip-req-build-hnl1xnl7/build/lib.linux-x86_64-cpython-311']' returned non-zero exit status 1. [end of output]
note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for transformer-engine
@RuiWang1998https://github.com/RuiWang1998 Could you help me what should I do? Install CUDNN?
— Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/TransformerEngine/issues/355#issuecomment-1820503928, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHUU7JFXB74O7EPHGY5HJULYFRVGNAVCNFSM6AAAAAA3CJV7S2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRQGUYDGOJSHA. You are receiving this because you were mentioned.Message ID: @.***>
Hi, @MicPie ,
We have been able to install this with newer commits now. Were you trying on stable releases?
@RuiWang1998 Could you show which release version that you use ? I had the same problems. Thanks.
Same issue
File "/aml2/TransformerEngine/setup.py", line 338, in _build_cmake raise RuntimeError(f"Error when running CMake: {e}") RuntimeError: Error when running CMake: Command '['/aml/conda/bin/cmake', '-S', '/aml2/TransformerEngine/transformer_engine', '-B', '/aml2/TransformerEngine/build/cmake', '-DPython_EXECUTABLE=/aml2/ds2/bin/python', '-DPython_INCLUDE_DIR=/aml2/ds2/include/python3.10', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/aml2/TransformerEngine/build/lib.linux-x86_64-cpython-310', '-GNinja', '-Dpybind11_DIR=/aml2/ds2/lib/python3.10/site-packages/pybind11/share/cmake/pybind11']' returned non-zero exit status 1. [end of output]
The CMake error message should already be printed to stderr, although it is somewhat buried within the Python stacktrace from setup.py
. It may be helpful to search for "Building CMake extension transformer_engine" within your build logs.
If the error is happening during CMake configuration, it's probably because CUDA or cuDNN are not properly installed. See CUDA instructions at https://github.com/NVIDIA/TransformerEngine/issues/700#issuecomment-1979377899. For cuDNN, make sure CUDNN_PATH
is set in your environment.
I solved this issue by simply use this command
git submodule update --init --recursive
Under the TransformerEngine dir, I hope this might help you.
I also meet the question. the question details information is :
raise RuntimeError(f"Error when running CMake: {e}") RuntimeError: Error when running CMake: Command '['/usr/bin/cmake', '-S', '/tmp/pip-req-build-yvwm9h7r/transformer_engine', '-B', '/tmp/pip-req-build-yvwm9h7r/build/cmake', DPython_EXECUTABLE=/home/ubuntu/train/aconconda/acondada/envs/yuxunlian/bin/python3.1', '-DPython_INCLUDE_DIR=/home/ubuntu/train/aconconda/acondada/envs/yuxunlian/include/python3.11', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/tmp/pip-req-build-yvwm9h7r/build/lib.linux-x86_64-cpython-311', '-GNinja']' returned non-zero exit status 1.
My environment is below: ubuntu 22.04 cuda:11.7 python: 3.11 torch:2.3.1 nvidia driver:535.183.06 Look forward to a solution!
I also meet the question. the question details information is :
raise RuntimeError(f"Error when running CMake: {e}") RuntimeError: Error when running CMake: Command '['/usr/bin/cmake', '-S', '/tmp/pip-req-build-yvwm9h7r/transformer_engine', '-B', '/tmp/pip-req-build-yvwm9h7r/build/cmake', DPython_EXECUTABLE=/home/ubuntu/train/aconconda/acondada/envs/yuxunlian/bin/python3.1', '-DPython_INCLUDE_DIR=/home/ubuntu/train/aconconda/acondada/envs/yuxunlian/include/python3.11', '-DCMAKE_BUILD_TYPE=Release', '-DCMAKE_INSTALL_PREFIX=/tmp/pip-req-build-yvwm9h7r/build/lib.linux-x86_64-cpython-311', '-GNinja']' returned non-zero exit status 1.
My environment is below: ubuntu 22.04 cuda:11.7 python: 3.11 torch:2.3.1 nvidia driver:535.183.06 Look forward to a solution!
Hello, my friend! You can check if your nvcc is added to environment.
nvcc --version
If error occurs, you may fix it by export PATH=/usr/local/cuda/bin:$PATH
or something like this.
@wplf yeah! my nvcc is seem ok! the information is below:
ubuntu@ip-172-31-38-93:~$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Tue_May__3_18:49:52_PDT_2022 Cuda compilation tools, release 11.7, V11.7.64 Build cuda_11.7.r11.7/compiler.31294372_0 Are there any other solutions?
compiler
Can you check your cmake version?
You can install cmake by pip install cmake
@wplf the cmake version is below:
(yuxunlian) ubuntu@ip-172-31-38-93:~$ cmake --version cmake version 3.22.1 CMake suite maintained and supported by Kitware (kitware.com/cmake).
Is this version appropriate?
@wplf the cmake version is below:
(yuxunlian) ubuntu@ip-172-31-38-93:~$ cmake --version cmake version 3.22.1 CMake suite maintained and supported by Kitware (kitware.com/cmake).
Is this version appropriate?
Yes, this is ok。 Sorry, I can't help you anymore.
@wplf
it does not matter! Thank you for your reply!
Any update on this issue? I'm still getting the same error.
If you are experiencing an error that looks like RuntimeError: Error when running CMake
, then something has failed in the build process (probably a CMake configuration error or a compilation error). Please look through the build logs to find more details or post enough of the build logs so we can figure out what's going on. To print the maximum amount of information during the build process:
cd transformer_engine
pip install -v -v -v .
Some common build errors and fixes:
- Uninitialized Git submodules: Run
git submodule update --init --recursive
. - CMake can't find a C++ compiler: Set
CXX
in the environment. - CMake can't find CUDA: Set
CUDA_PATH
in the environment. - CMake can't find cuDNN: Set
CUDNN_PATH
in the environment. - Invalid dependency versions: Consult TE's requirements. As of TE 1.11, TE requires CUDA 12.0+ and cuDNN 8.1+.
- Hang during compilation: Try disabling parallelism in the build process by setting
MAX_JOBS=1
andNVTE_BUILD_THREADS_PER_JOB=1
in the environment. See https://github.com/NVIDIA/TransformerEngine/issues/1077#issuecomment-2389735640 for more guidance.
I'll lock this issue to make this comment easier for users to find, but please open a new issue if you are encountering a build error (with enough of the build log for us to help).