TransformerEngine
TransformerEngine copied to clipboard
Can't build the TE wheel via pip (1 error detected in the compilation of "transformer_engine/pytorch/csrc/userbuffers/userbuffers.cu")
Hi! I'm getting the following error when trying to install TE via pip. Would appreciate some help to see what's going on:
running egg_info
creating transformer_engine.egg-info
writing transformer_engine.egg-info/PKG-INFO
writing dependency_links to transformer_engine.egg-info/dependency_links.txt
writing requirements to transformer_engine.egg-info/requires.txt
writing top-level names to transformer_engine.egg-info/top_level.txt
writing manifest file 'transformer_engine.egg-info/SOURCES.txt'
reading manifest file 'transformer_engine.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'transformer_engine.egg-info/SOURCES.txt'
/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/setuptools/command/build_py.py:207: _Warning: Package 'transformer_engine.pytorch.csrc' is absent from the `packages` configuration.
!!
********************************************************************************
############################
# Package would be ignored #
############################
Python recognizes 'transformer_engine.pytorch.csrc' as an importable package[^1],
but it is absent from setuptools' `packages` configuration.
This leads to an ambiguous overall configuration. If you want to distribute this
package, please make sure that 'transformer_engine.pytorch.csrc' is explicitly added
to the `packages` configuration field.
Alternatively, you can also rely on setuptools' discovery methods
(for example by using `find_namespace_packages(...)`/`find_namespace:`
instead of `find_packages(...)`/`find:`).
You can read more about "package discovery" on setuptools documentation page:
- https://setuptools.pypa.io/en/latest/userguide/package_discovery.html
If you don't want 'transformer_engine.pytorch.csrc' to be distributed and are
already explicitly excluding 'transformer_engine.pytorch.csrc' via
`find_namespace_packages(...)/find_namespace` or `find_packages(...)/find`,
you can try to use `exclude_package_data`, or `include-package-data=False` in
combination with a more fine grained `package-data` configuration.
You can read more about "package data files" on setuptools documentation page:
- https://setuptools.pypa.io/en/latest/userguide/datafiles.html
[^1]: For Python, any directory (with suitable naming) can be imported,
even if it does not contain any `.py` files.
On the other hand, currently there is no concept of package data
directory, all directories are treated like packages.
********************************************************************************
!!
check.warn(importable)
/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/setuptools/command/build_py.py:207: _Warning: Package 'transformer_engine.pytorch.csrc.extensions' is absent from the `packages` configuration.
!!
********************************************************************************
############################
# Package would be ignored #
############################
Python recognizes 'transformer_engine.pytorch.csrc.extensions' as an importable package[^1],
but it is absent from setuptools' `packages` configuration.
This leads to an ambiguous overall configuration. If you want to distribute this
package, please make sure that 'transformer_engine.pytorch.csrc.extensions' is explicitly added
to the `packages` configuration field.
Alternatively, you can also rely on setuptools' discovery methods
(for example by using `find_namespace_packages(...)`/`find_namespace:`
instead of `find_packages(...)`/`find:`).
You can read more about "package discovery" on setuptools documentation page:
- https://setuptools.pypa.io/en/latest/userguide/package_discovery.html
If you don't want 'transformer_engine.pytorch.csrc.extensions' to be distributed and are
already explicitly excluding 'transformer_engine.pytorch.csrc.extensions' via
`find_namespace_packages(...)/find_namespace` or `find_packages(...)/find`,
you can try to use `exclude_package_data`, or `include-package-data=False` in
combination with a more fine grained `package-data` configuration.
You can read more about "package data files" on setuptools documentation page:
- https://setuptools.pypa.io/en/latest/userguide/datafiles.html
[^1]: For Python, any directory (with suitable naming) can be imported,
even if it does not contain any `.py` files.
On the other hand, currently there is no concept of package data
directory, all directories are treated like packages.
********************************************************************************
!!
check.warn(importable)
/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/setuptools/command/build_py.py:207: _Warning: Package 'transformer_engine.pytorch.csrc.extensions.multi_tensor' is absent from the `packages` configuration.
!!
********************************************************************************
############################
# Package would be ignored #
############################
Python recognizes 'transformer_engine.pytorch.csrc.extensions.multi_tensor' as an importable package[^1],
but it is absent from setuptools' `packages` configuration.
This leads to an ambiguous overall configuration. If you want to distribute this
package, please make sure that 'transformer_engine.pytorch.csrc.extensions.multi_tensor' is explicitly added
to the `packages` configuration field.
Alternatively, you can also rely on setuptools' discovery methods
(for example by using `find_namespace_packages(...)`/`find_namespace:`
instead of `find_packages(...)`/`find:`).
You can read more about "package discovery" on setuptools documentation page:
- https://setuptools.pypa.io/en/latest/userguide/package_discovery.html
If you don't want 'transformer_engine.pytorch.csrc.extensions.multi_tensor' to be distributed and are
already explicitly excluding 'transformer_engine.pytorch.csrc.extensions.multi_tensor' via
`find_namespace_packages(...)/find_namespace` or `find_packages(...)/find`,
you can try to use `exclude_package_data`, or `include-package-data=False` in
combination with a more fine grained `package-data` configuration.
You can read more about "package data files" on setuptools documentation page:
- https://setuptools.pypa.io/en/latest/userguide/datafiles.html
[^1]: For Python, any directory (with suitable naming) can be imported,
even if it does not contain any `.py` files.
On the other hand, currently there is no concept of package data
directory, all directories are treated like packages.
********************************************************************************
!!
check.warn(importable)
/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/setuptools/command/build_py.py:207: _Warning: Package 'transformer_engine.pytorch.csrc.userbuffers' is absent from the `packages` configuration.
!!
********************************************************************************
############################
# Package would be ignored #
############################
Python recognizes 'transformer_engine.pytorch.csrc.userbuffers' as an importable package[^1],
but it is absent from setuptools' `packages` configuration.
This leads to an ambiguous overall configuration. If you want to distribute this
package, please make sure that 'transformer_engine.pytorch.csrc.userbuffers' is explicitly added
to the `packages` configuration field.
Alternatively, you can also rely on setuptools' discovery methods
(for example by using `find_namespace_packages(...)`/`find_namespace:`
instead of `find_packages(...)`/`find:`).
You can read more about "package discovery" on setuptools documentation page:
- https://setuptools.pypa.io/en/latest/userguide/package_discovery.html
If you don't want 'transformer_engine.pytorch.csrc.userbuffers' to be distributed and are
already explicitly excluding 'transformer_engine.pytorch.csrc.userbuffers' via
`find_namespace_packages(...)/find_namespace` or `find_packages(...)/find`,
you can try to use `exclude_package_data`, or `include-package-data=False` in
combination with a more fine grained `package-data` configuration.
You can read more about "package data files" on setuptools documentation page:
- https://setuptools.pypa.io/en/latest/userguide/datafiles.html
[^1]: For Python, any directory (with suitable naming) can be imported,
even if it does not contain any `.py` files.
On the other hand, currently there is no concept of package data
directory, all directories are treated like packages.
********************************************************************************
!!
check.warn(importable)
creating build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc
copying transformer_engine/pytorch/csrc/common.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc
copying transformer_engine/pytorch/csrc/ts_fp8_op.cpp -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc
creating build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
copying transformer_engine/pytorch/csrc/extensions/activation.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
copying transformer_engine/pytorch/csrc/extensions/apply_rope.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
copying transformer_engine/pytorch/csrc/extensions/attention.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
copying transformer_engine/pytorch/csrc/extensions/cast.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
copying transformer_engine/pytorch/csrc/extensions/gemm.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
copying transformer_engine/pytorch/csrc/extensions/misc.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
copying transformer_engine/pytorch/csrc/extensions/normalization.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
copying transformer_engine/pytorch/csrc/extensions/pybind.cpp -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
copying transformer_engine/pytorch/csrc/extensions/recipe.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
copying transformer_engine/pytorch/csrc/extensions/softmax.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
copying transformer_engine/pytorch/csrc/extensions/transpose.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
creating build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/multi_tensor
copying transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_adam.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/multi_tensor
copying transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_l2norm_kernel.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/multi_tensor
copying transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_scale_kernel.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/multi_tensor
copying transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_sgd_kernel.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/multi_tensor
creating build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/userbuffers
copying transformer_engine/pytorch/csrc/userbuffers/ipcsocket.cc -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/userbuffers
copying transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/userbuffers
copying transformer_engine/pytorch/csrc/userbuffers/userbuffers.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/userbuffers
running build_ext
Building CMake extension transformer_engine
Running command /usr/bin/cmake -S /tmp/pip-req-build-jglepsmr/transformer_engine/common -B /tmp/pip-req-build-jglepsmr/build/cmake -DPython_EXECUTABLE=/scratch/user/u.tv216541/te-dev/bin/python -DPython_INCLUDE_DIR=/scratch/user/u.tv216541/te-dev/include/python3.11 -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/tmp/pip-req-build-jglepsmr/build/lib.linux-x86_64-cpython-311 -Dpybind11_DIR=/tmp/pip-req-build-jglepsmr/.eggs/pybind11-2.13.1-py3.11.egg/pybind11/share/cmake/pybind11
-- The CUDA compiler identification is NVIDIA 12.1.66
-- The CXX compiler identification is GNU 11.2.0
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /scratch/user/u.tv216541/te-dev/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /scratch/user/u.tv216541/te-dev/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found CUDAToolkit: /scratch/user/u.tv216541/te-dev/include (found version "12.1.66")
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- cudnn found at /scratch/user/u.tv216541/te-dev/lib/libcudnn.so.
-- Found LIBRARY: /scratch/user/u.tv216541/te-dev/include
-- cuDNN: /scratch/user/u.tv216541/te-dev/lib/libcudnn.so
-- cuDNN: /scratch/user/u.tv216541/te-dev/include
-- cudnn_adv_infer found at /scratch/user/u.tv216541/te-dev/lib/libcudnn_adv_infer.so.
-- cudnn_adv_train found at /scratch/user/u.tv216541/te-dev/lib/libcudnn_adv_train.so.
-- cudnn_cnn_infer found at /scratch/user/u.tv216541/te-dev/lib/libcudnn_cnn_infer.so.
-- cudnn_cnn_train found at /scratch/user/u.tv216541/te-dev/lib/libcudnn_cnn_train.so.
-- cudnn_ops_infer found at /scratch/user/u.tv216541/te-dev/lib/libcudnn_ops_infer.so.
-- cudnn_ops_train found at /scratch/user/u.tv216541/te-dev/lib/libcudnn_ops_train.so.
-- Found Python: /scratch/user/u.tv216541/te-dev/bin/python (found version "3.11.5") found components: Interpreter Development.Module
-- Configuring done
-- Generating done
CMake Warning:
Manually-specified variables were not used by the project:
pybind11_DIR
-- Build files have been written to: /tmp/pip-req-build-jglepsmr/build/cmake
Running command /usr/bin/cmake --build /tmp/pip-req-build-jglepsmr/build/cmake
[ 3%] Building CXX object CMakeFiles/transformer_engine.dir/transformer_engine.cpp.o
[ 6%] Building CUDA object CMakeFiles/transformer_engine.dir/transpose/cast_transpose.cu.o
[ 9%] Building CUDA object CMakeFiles/transformer_engine.dir/transpose/transpose.cu.o
[ 12%] Building CUDA object CMakeFiles/transformer_engine.dir/transpose/cast_transpose_fusion.cu.o
[ 15%] Building CUDA object CMakeFiles/transformer_engine.dir/transpose/transpose_fusion.cu.o
[ 18%] Building CUDA object CMakeFiles/transformer_engine.dir/transpose/multi_cast_transpose.cu.o
[ 21%] Building CUDA object CMakeFiles/transformer_engine.dir/activation/gelu.cu.o
[ 25%] Building CUDA object CMakeFiles/transformer_engine.dir/fused_attn/fused_attn_f16_max512_seqlen.cu.o
[ 28%] Building CUDA object CMakeFiles/transformer_engine.dir/fused_attn/fused_attn_f16_arbitrary_seqlen.cu.o
[ 31%] Building CUDA object CMakeFiles/transformer_engine.dir/activation/relu.cu.o
[ 34%] Building CUDA object CMakeFiles/transformer_engine.dir/activation/swiglu.cu.o
[ 37%] Building CUDA object CMakeFiles/transformer_engine.dir/fused_attn/fused_attn_fp8.cu.o
[ 40%] Building CXX object CMakeFiles/transformer_engine.dir/fused_attn/fused_attn.cpp.o
[ 43%] Building CUDA object CMakeFiles/transformer_engine.dir/fused_attn/utils.cu.o
[ 46%] Building CUDA object CMakeFiles/transformer_engine.dir/gemm/cublaslt_gemm.cu.o
/tmp/pip-req-build-jglepsmr/transformer_engine/common/gemm/cublaslt_gemm.cu(69): warning #550-D: variable "counter" was set but never used
void *counter = nullptr;
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/tmp/pip-req-build-jglepsmr/transformer_engine/common/gemm/cublaslt_gemm.cu(69): warning #550-D: variable "counter" was set but never used
void *counter = nullptr;
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/tmp/pip-req-build-jglepsmr/transformer_engine/common/gemm/cublaslt_gemm.cu(69): warning #550-D: variable "counter" was set but never used
void *counter = nullptr;
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
/tmp/pip-req-build-jglepsmr/transformer_engine/common/gemm/cublaslt_gemm.cu(69): warning #550-D: variable "counter" was set but never used
void *counter = nullptr;
^
Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
[ 50%] Building CXX object CMakeFiles/transformer_engine.dir/layer_norm/ln_api.cpp.o
[ 53%] Building CUDA object CMakeFiles/transformer_engine.dir/layer_norm/ln_bwd_semi_cuda_kernel.cu.o
[ 56%] Building CUDA object CMakeFiles/transformer_engine.dir/layer_norm/ln_fwd_cuda_kernel.cu.o
[ 59%] Building CXX object CMakeFiles/transformer_engine.dir/rmsnorm/rmsnorm_api.cpp.o
[ 62%] Building CUDA object CMakeFiles/transformer_engine.dir/rmsnorm/rmsnorm_bwd_semi_cuda_kernel.cu.o
[ 65%] Building CUDA object CMakeFiles/transformer_engine.dir/rmsnorm/rmsnorm_fwd_cuda_kernel.cu.o
[ 68%] Building CUDA object CMakeFiles/transformer_engine.dir/util/cast.cu.o
[ 71%] Building CXX object CMakeFiles/transformer_engine.dir/util/cuda_driver.cpp.o
[ 75%] Building CXX object CMakeFiles/transformer_engine.dir/util/cuda_runtime.cpp.o
[ 78%] Building CXX object CMakeFiles/transformer_engine.dir/util/rtc.cpp.o
[ 81%] Building CXX object CMakeFiles/transformer_engine.dir/util/system.cpp.o
[ 84%] Building CUDA object CMakeFiles/transformer_engine.dir/fused_softmax/scaled_masked_softmax.cu.o
[ 87%] Building CUDA object CMakeFiles/transformer_engine.dir/fused_softmax/scaled_upper_triang_masked_softmax.cu.o
[ 90%] Building CUDA object CMakeFiles/transformer_engine.dir/fused_softmax/scaled_aligned_causal_masked_softmax.cu.o
[ 93%] Building CUDA object CMakeFiles/transformer_engine.dir/fused_rope/fused_rope.cu.o
[ 96%] Building CUDA object CMakeFiles/transformer_engine.dir/recipe/delayed_scaling.cu.o
[100%] Linking CXX shared library libtransformer_engine.so
[100%] Built target transformer_engine
Running command /usr/bin/cmake --install /tmp/pip-req-build-jglepsmr/build/cmake
-- Install configuration: "Release"
-- Installing: /tmp/pip-req-build-jglepsmr/build/lib.linux-x86_64-cpython-311/./libtransformer_engine.so
-- Set runtime path of "/tmp/pip-req-build-jglepsmr/build/lib.linux-x86_64-cpython-311/./libtransformer_engine.so" to ""
/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/utils/cpp_extension.py:428: UserWarning: There are no g++ version bounds defined for CUDA version 12.1
warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
building 'transformer_engine_torch' extension
creating build/temp.linux-x86_64-cpython-311
creating build/temp.linux-x86_64-cpython-311/transformer_engine
creating build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch
creating build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc
creating build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
creating build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/multi_tensor
creating build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/userbuffers
/scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/common.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/common.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
/scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/activation.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/activation.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
/scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/apply_rope.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/apply_rope.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
/scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/attention.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/attention.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
/scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/cast.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/cast.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
/scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/gemm.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/gemm.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
/scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/misc.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/misc.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
/scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_adam.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_adam.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
/scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_l2norm_kernel.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_l2norm_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
/scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_scale_kernel.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_scale_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
/scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_sgd_kernel.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_sgd_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
/scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/normalization.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/normalization.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
gcc -pthread -B /scratch/user/u.tv216541/te-dev/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /scratch/user/u.tv216541/te-dev/include -fPIC -O2 -isystem /scratch/user/u.tv216541/te-dev/include -fPIC -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/pybind.cpp -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/pybind.o -O3 -fvisibility=hidden -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
/scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/recipe.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/recipe.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
/scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/softmax.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/softmax.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
/scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/transpose.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/transpose.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
gcc -pthread -B /scratch/user/u.tv216541/te-dev/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /scratch/user/u.tv216541/te-dev/include -fPIC -O2 -isystem /scratch/user/u.tv216541/te-dev/include -fPIC -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/ts_fp8_op.cpp -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/ts_fp8_op.o -O3 -fvisibility=hidden -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
gcc -pthread -B /scratch/user/u.tv216541/te-dev/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /scratch/user/u.tv216541/te-dev/include -fPIC -O2 -isystem /scratch/user/u.tv216541/te-dev/include -fPIC -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/userbuffers/ipcsocket.cc -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/userbuffers/ipcsocket.o -O3 -fvisibility=hidden -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
gcc -pthread -B /scratch/user/u.tv216541/te-dev/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /scratch/user/u.tv216541/te-dev/include -fPIC -O2 -isystem /scratch/user/u.tv216541/te-dev/include -fPIC -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.o -O3 -fvisibility=hidden -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
/scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/userbuffers/userbuffers.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/userbuffers/userbuffers.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
/scratch/user/u.tv216541/te-dev/include/cuda_fp16.hpp(2724): error: invalid redeclaration of type name "nv_bfloat16" (declared at line 2837 of /scratch/user/u.tv216541/te-dev/include/cuda_bf16.hpp)
typedef __half nv_bfloat16;
^
1 error detected in the compilation of "transformer_engine/pytorch/csrc/userbuffers/userbuffers.cu".
/scratch/user/u.tv216541/te-dev/include/cuda_fp16.hpp(2724): error: invalid redeclaration of type name "nv_bfloat16" (declared at line 2837 of /scratch/user/u.tv216541/te-dev/include/cuda_bf16.hpp)
typedef __half nv_bfloat16;
^
1 error detected in the compilation of "transformer_engine/pytorch/csrc/userbuffers/userbuffers.cu".
error: command '/scratch/user/u.tv216541/te-dev/bin/nvcc' failed with exit code 255
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for transformer_engine Running setup.py clean for transformer_engine Failed to build transformer_engine ERROR: Could not build wheels for transformer_engine, which is required to install pyproject.toml-based projects
Are you building the 1.9 release or the main branch? This looks like an error that was fixed with https://github.com/NVIDIA/TransformerEngine/pull/949.
If that doesn't fix it, perhaps it's something with the CUDA version? The error message says that cuda_fp16.hpp is replacing BF16 with FP16, which seems wrong to me. I haven't been able to easily dig up your CUDA version (12.1.66), but I don't see this logic in 12.1.55 or 12.1.105
I think the issue slipped into the TEv1.8 release as I had the same installation issue which was resolved by cherry-picking https://github.com/NVIDIA/TransformerEngine/pull/949.
I've gone ahead and cherry-picked #949 into the 1.8 release.