TransformerEngine icon indicating copy to clipboard operation
TransformerEngine copied to clipboard

Can't build the TE wheel via pip (1 error detected in the compilation of "transformer_engine/pytorch/csrc/userbuffers/userbuffers.cu")

Open tarunvallabh opened this issue 1 year ago • 2 comments

Hi! I'm getting the following error when trying to install TE via pip. Would appreciate some help to see what's going on:

  running egg_info
  creating transformer_engine.egg-info
  writing transformer_engine.egg-info/PKG-INFO
  writing dependency_links to transformer_engine.egg-info/dependency_links.txt
  writing requirements to transformer_engine.egg-info/requires.txt
  writing top-level names to transformer_engine.egg-info/top_level.txt
  writing manifest file 'transformer_engine.egg-info/SOURCES.txt'
  reading manifest file 'transformer_engine.egg-info/SOURCES.txt'
  adding license file 'LICENSE'
  writing manifest file 'transformer_engine.egg-info/SOURCES.txt'
  /scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/setuptools/command/build_py.py:207: _Warning: Package 'transformer_engine.pytorch.csrc' is absent from the `packages` configuration.
  !!
  
          ********************************************************************************
          ############################
          # Package would be ignored #
          ############################
          Python recognizes 'transformer_engine.pytorch.csrc' as an importable package[^1],
          but it is absent from setuptools' `packages` configuration.
  
          This leads to an ambiguous overall configuration. If you want to distribute this
          package, please make sure that 'transformer_engine.pytorch.csrc' is explicitly added
          to the `packages` configuration field.
  
          Alternatively, you can also rely on setuptools' discovery methods
          (for example by using `find_namespace_packages(...)`/`find_namespace:`
          instead of `find_packages(...)`/`find:`).
  
          You can read more about "package discovery" on setuptools documentation page:
  
          - https://setuptools.pypa.io/en/latest/userguide/package_discovery.html
  
          If you don't want 'transformer_engine.pytorch.csrc' to be distributed and are
          already explicitly excluding 'transformer_engine.pytorch.csrc' via
          `find_namespace_packages(...)/find_namespace` or `find_packages(...)/find`,
          you can try to use `exclude_package_data`, or `include-package-data=False` in
          combination with a more fine grained `package-data` configuration.
  
          You can read more about "package data files" on setuptools documentation page:
  
          - https://setuptools.pypa.io/en/latest/userguide/datafiles.html
  
  
          [^1]: For Python, any directory (with suitable naming) can be imported,
                even if it does not contain any `.py` files.
                On the other hand, currently there is no concept of package data
                directory, all directories are treated like packages.
          ********************************************************************************
  
  !!
    check.warn(importable)
  /scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/setuptools/command/build_py.py:207: _Warning: Package 'transformer_engine.pytorch.csrc.extensions' is absent from the `packages` configuration.
  !!
  
          ********************************************************************************
          ############################
          # Package would be ignored #
          ############################
          Python recognizes 'transformer_engine.pytorch.csrc.extensions' as an importable package[^1],
          but it is absent from setuptools' `packages` configuration.
  
          This leads to an ambiguous overall configuration. If you want to distribute this
          package, please make sure that 'transformer_engine.pytorch.csrc.extensions' is explicitly added
          to the `packages` configuration field.
  
          Alternatively, you can also rely on setuptools' discovery methods
          (for example by using `find_namespace_packages(...)`/`find_namespace:`
          instead of `find_packages(...)`/`find:`).
  
          You can read more about "package discovery" on setuptools documentation page:
  
          - https://setuptools.pypa.io/en/latest/userguide/package_discovery.html
  
          If you don't want 'transformer_engine.pytorch.csrc.extensions' to be distributed and are
          already explicitly excluding 'transformer_engine.pytorch.csrc.extensions' via
          `find_namespace_packages(...)/find_namespace` or `find_packages(...)/find`,
          you can try to use `exclude_package_data`, or `include-package-data=False` in
          combination with a more fine grained `package-data` configuration.
  
          You can read more about "package data files" on setuptools documentation page:
  
          - https://setuptools.pypa.io/en/latest/userguide/datafiles.html
  
  
          [^1]: For Python, any directory (with suitable naming) can be imported,
                even if it does not contain any `.py` files.
                On the other hand, currently there is no concept of package data
                directory, all directories are treated like packages.
          ********************************************************************************
  
  !!
    check.warn(importable)
  /scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/setuptools/command/build_py.py:207: _Warning: Package 'transformer_engine.pytorch.csrc.extensions.multi_tensor' is absent from the `packages` configuration.
  !!
  
          ********************************************************************************
          ############################
          # Package would be ignored #
          ############################
          Python recognizes 'transformer_engine.pytorch.csrc.extensions.multi_tensor' as an importable package[^1],
          but it is absent from setuptools' `packages` configuration.
  
          This leads to an ambiguous overall configuration. If you want to distribute this
          package, please make sure that 'transformer_engine.pytorch.csrc.extensions.multi_tensor' is explicitly added
          to the `packages` configuration field.
  
          Alternatively, you can also rely on setuptools' discovery methods
          (for example by using `find_namespace_packages(...)`/`find_namespace:`
          instead of `find_packages(...)`/`find:`).
  
          You can read more about "package discovery" on setuptools documentation page:
  
          - https://setuptools.pypa.io/en/latest/userguide/package_discovery.html
  
          If you don't want 'transformer_engine.pytorch.csrc.extensions.multi_tensor' to be distributed and are
          already explicitly excluding 'transformer_engine.pytorch.csrc.extensions.multi_tensor' via
          `find_namespace_packages(...)/find_namespace` or `find_packages(...)/find`,
          you can try to use `exclude_package_data`, or `include-package-data=False` in
          combination with a more fine grained `package-data` configuration.
  
          You can read more about "package data files" on setuptools documentation page:
  
          - https://setuptools.pypa.io/en/latest/userguide/datafiles.html
  
  
          [^1]: For Python, any directory (with suitable naming) can be imported,
                even if it does not contain any `.py` files.
                On the other hand, currently there is no concept of package data
                directory, all directories are treated like packages.
          ********************************************************************************
  
  !!
    check.warn(importable)
  /scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/setuptools/command/build_py.py:207: _Warning: Package 'transformer_engine.pytorch.csrc.userbuffers' is absent from the `packages` configuration.
  !!
  
          ********************************************************************************
          ############################
          # Package would be ignored #
          ############################
          Python recognizes 'transformer_engine.pytorch.csrc.userbuffers' as an importable package[^1],
          but it is absent from setuptools' `packages` configuration.
  
          This leads to an ambiguous overall configuration. If you want to distribute this
          package, please make sure that 'transformer_engine.pytorch.csrc.userbuffers' is explicitly added
          to the `packages` configuration field.
  
          Alternatively, you can also rely on setuptools' discovery methods
          (for example by using `find_namespace_packages(...)`/`find_namespace:`
          instead of `find_packages(...)`/`find:`).
  
          You can read more about "package discovery" on setuptools documentation page:
  
          - https://setuptools.pypa.io/en/latest/userguide/package_discovery.html
  
          If you don't want 'transformer_engine.pytorch.csrc.userbuffers' to be distributed and are
          already explicitly excluding 'transformer_engine.pytorch.csrc.userbuffers' via
          `find_namespace_packages(...)/find_namespace` or `find_packages(...)/find`,
          you can try to use `exclude_package_data`, or `include-package-data=False` in
          combination with a more fine grained `package-data` configuration.
  
          You can read more about "package data files" on setuptools documentation page:
  
          - https://setuptools.pypa.io/en/latest/userguide/datafiles.html
  
  
          [^1]: For Python, any directory (with suitable naming) can be imported,
                even if it does not contain any `.py` files.
                On the other hand, currently there is no concept of package data
                directory, all directories are treated like packages.
          ********************************************************************************
  
  !!
    check.warn(importable)
  creating build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc
  copying transformer_engine/pytorch/csrc/common.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc
  copying transformer_engine/pytorch/csrc/ts_fp8_op.cpp -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc
  creating build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
  copying transformer_engine/pytorch/csrc/extensions/activation.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
  copying transformer_engine/pytorch/csrc/extensions/apply_rope.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
  copying transformer_engine/pytorch/csrc/extensions/attention.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
  copying transformer_engine/pytorch/csrc/extensions/cast.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
  copying transformer_engine/pytorch/csrc/extensions/gemm.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
  copying transformer_engine/pytorch/csrc/extensions/misc.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
  copying transformer_engine/pytorch/csrc/extensions/normalization.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
  copying transformer_engine/pytorch/csrc/extensions/pybind.cpp -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
  copying transformer_engine/pytorch/csrc/extensions/recipe.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
  copying transformer_engine/pytorch/csrc/extensions/softmax.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
  copying transformer_engine/pytorch/csrc/extensions/transpose.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
  creating build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/multi_tensor
  copying transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_adam.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/multi_tensor
  copying transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_l2norm_kernel.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/multi_tensor
  copying transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_scale_kernel.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/multi_tensor
  copying transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_sgd_kernel.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/multi_tensor
  creating build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/userbuffers
  copying transformer_engine/pytorch/csrc/userbuffers/ipcsocket.cc -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/userbuffers
  copying transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/userbuffers
  copying transformer_engine/pytorch/csrc/userbuffers/userbuffers.cu -> build/lib.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/userbuffers
  running build_ext
  Building CMake extension transformer_engine
  Running command /usr/bin/cmake -S /tmp/pip-req-build-jglepsmr/transformer_engine/common -B /tmp/pip-req-build-jglepsmr/build/cmake -DPython_EXECUTABLE=/scratch/user/u.tv216541/te-dev/bin/python -DPython_INCLUDE_DIR=/scratch/user/u.tv216541/te-dev/include/python3.11 -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/tmp/pip-req-build-jglepsmr/build/lib.linux-x86_64-cpython-311 -Dpybind11_DIR=/tmp/pip-req-build-jglepsmr/.eggs/pybind11-2.13.1-py3.11.egg/pybind11/share/cmake/pybind11
  -- The CUDA compiler identification is NVIDIA 12.1.66
  -- The CXX compiler identification is GNU 11.2.0
  -- Detecting CUDA compiler ABI info
  -- Detecting CUDA compiler ABI info - done
  -- Check for working CUDA compiler: /scratch/user/u.tv216541/te-dev/bin/nvcc - skipped
  -- Detecting CUDA compile features
  -- Detecting CUDA compile features - done
  -- Detecting CXX compiler ABI info
  -- Detecting CXX compiler ABI info - done
  -- Check for working CXX compiler: /scratch/user/u.tv216541/te-dev/bin/c++ - skipped
  -- Detecting CXX compile features
  -- Detecting CXX compile features - done
  -- Found CUDAToolkit: /scratch/user/u.tv216541/te-dev/include (found version "12.1.66")
  -- Looking for C++ include pthread.h
  -- Looking for C++ include pthread.h - found
  -- Performing Test CMAKE_HAVE_LIBC_PTHREAD
  -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
  -- Looking for pthread_create in pthreads
  -- Looking for pthread_create in pthreads - not found
  -- Looking for pthread_create in pthread
  -- Looking for pthread_create in pthread - found
  -- Found Threads: TRUE
  -- cudnn found at /scratch/user/u.tv216541/te-dev/lib/libcudnn.so.
  -- Found LIBRARY: /scratch/user/u.tv216541/te-dev/include
  -- cuDNN: /scratch/user/u.tv216541/te-dev/lib/libcudnn.so
  -- cuDNN: /scratch/user/u.tv216541/te-dev/include
  -- cudnn_adv_infer found at /scratch/user/u.tv216541/te-dev/lib/libcudnn_adv_infer.so.
  -- cudnn_adv_train found at /scratch/user/u.tv216541/te-dev/lib/libcudnn_adv_train.so.
  -- cudnn_cnn_infer found at /scratch/user/u.tv216541/te-dev/lib/libcudnn_cnn_infer.so.
  -- cudnn_cnn_train found at /scratch/user/u.tv216541/te-dev/lib/libcudnn_cnn_train.so.
  -- cudnn_ops_infer found at /scratch/user/u.tv216541/te-dev/lib/libcudnn_ops_infer.so.
  -- cudnn_ops_train found at /scratch/user/u.tv216541/te-dev/lib/libcudnn_ops_train.so.
  -- Found Python: /scratch/user/u.tv216541/te-dev/bin/python (found version "3.11.5") found components: Interpreter Development.Module
  -- Configuring done
  -- Generating done
  CMake Warning:
    Manually-specified variables were not used by the project:
  
      pybind11_DIR
  
  
  -- Build files have been written to: /tmp/pip-req-build-jglepsmr/build/cmake
  Running command /usr/bin/cmake --build /tmp/pip-req-build-jglepsmr/build/cmake
  [  3%] Building CXX object CMakeFiles/transformer_engine.dir/transformer_engine.cpp.o
  [  6%] Building CUDA object CMakeFiles/transformer_engine.dir/transpose/cast_transpose.cu.o
  [  9%] Building CUDA object CMakeFiles/transformer_engine.dir/transpose/transpose.cu.o
  [ 12%] Building CUDA object CMakeFiles/transformer_engine.dir/transpose/cast_transpose_fusion.cu.o
  [ 15%] Building CUDA object CMakeFiles/transformer_engine.dir/transpose/transpose_fusion.cu.o
  [ 18%] Building CUDA object CMakeFiles/transformer_engine.dir/transpose/multi_cast_transpose.cu.o
  [ 21%] Building CUDA object CMakeFiles/transformer_engine.dir/activation/gelu.cu.o
  [ 25%] Building CUDA object CMakeFiles/transformer_engine.dir/fused_attn/fused_attn_f16_max512_seqlen.cu.o
  [ 28%] Building CUDA object CMakeFiles/transformer_engine.dir/fused_attn/fused_attn_f16_arbitrary_seqlen.cu.o
  [ 31%] Building CUDA object CMakeFiles/transformer_engine.dir/activation/relu.cu.o
  [ 34%] Building CUDA object CMakeFiles/transformer_engine.dir/activation/swiglu.cu.o
  [ 37%] Building CUDA object CMakeFiles/transformer_engine.dir/fused_attn/fused_attn_fp8.cu.o
  [ 40%] Building CXX object CMakeFiles/transformer_engine.dir/fused_attn/fused_attn.cpp.o
  [ 43%] Building CUDA object CMakeFiles/transformer_engine.dir/fused_attn/utils.cu.o
  [ 46%] Building CUDA object CMakeFiles/transformer_engine.dir/gemm/cublaslt_gemm.cu.o
  /tmp/pip-req-build-jglepsmr/transformer_engine/common/gemm/cublaslt_gemm.cu(69): warning #550-D: variable "counter" was set but never used
      void *counter = nullptr;
            ^
  
  Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
  
  /tmp/pip-req-build-jglepsmr/transformer_engine/common/gemm/cublaslt_gemm.cu(69): warning #550-D: variable "counter" was set but never used
      void *counter = nullptr;
            ^
  
  Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
  
  /tmp/pip-req-build-jglepsmr/transformer_engine/common/gemm/cublaslt_gemm.cu(69): warning #550-D: variable "counter" was set but never used
      void *counter = nullptr;
            ^
  
  Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
  
  /tmp/pip-req-build-jglepsmr/transformer_engine/common/gemm/cublaslt_gemm.cu(69): warning #550-D: variable "counter" was set but never used
      void *counter = nullptr;
            ^
  
  Remark: The warnings can be suppressed with "-diag-suppress <warning-number>"
  
  [ 50%] Building CXX object CMakeFiles/transformer_engine.dir/layer_norm/ln_api.cpp.o
  [ 53%] Building CUDA object CMakeFiles/transformer_engine.dir/layer_norm/ln_bwd_semi_cuda_kernel.cu.o
  [ 56%] Building CUDA object CMakeFiles/transformer_engine.dir/layer_norm/ln_fwd_cuda_kernel.cu.o
  [ 59%] Building CXX object CMakeFiles/transformer_engine.dir/rmsnorm/rmsnorm_api.cpp.o
  [ 62%] Building CUDA object CMakeFiles/transformer_engine.dir/rmsnorm/rmsnorm_bwd_semi_cuda_kernel.cu.o
  [ 65%] Building CUDA object CMakeFiles/transformer_engine.dir/rmsnorm/rmsnorm_fwd_cuda_kernel.cu.o
  [ 68%] Building CUDA object CMakeFiles/transformer_engine.dir/util/cast.cu.o
  [ 71%] Building CXX object CMakeFiles/transformer_engine.dir/util/cuda_driver.cpp.o
  [ 75%] Building CXX object CMakeFiles/transformer_engine.dir/util/cuda_runtime.cpp.o
  [ 78%] Building CXX object CMakeFiles/transformer_engine.dir/util/rtc.cpp.o
  [ 81%] Building CXX object CMakeFiles/transformer_engine.dir/util/system.cpp.o
  [ 84%] Building CUDA object CMakeFiles/transformer_engine.dir/fused_softmax/scaled_masked_softmax.cu.o
  [ 87%] Building CUDA object CMakeFiles/transformer_engine.dir/fused_softmax/scaled_upper_triang_masked_softmax.cu.o
  [ 90%] Building CUDA object CMakeFiles/transformer_engine.dir/fused_softmax/scaled_aligned_causal_masked_softmax.cu.o
  [ 93%] Building CUDA object CMakeFiles/transformer_engine.dir/fused_rope/fused_rope.cu.o
  [ 96%] Building CUDA object CMakeFiles/transformer_engine.dir/recipe/delayed_scaling.cu.o
  [100%] Linking CXX shared library libtransformer_engine.so
  [100%] Built target transformer_engine
  Running command /usr/bin/cmake --install /tmp/pip-req-build-jglepsmr/build/cmake
  -- Install configuration: "Release"
  -- Installing: /tmp/pip-req-build-jglepsmr/build/lib.linux-x86_64-cpython-311/./libtransformer_engine.so
  -- Set runtime path of "/tmp/pip-req-build-jglepsmr/build/lib.linux-x86_64-cpython-311/./libtransformer_engine.so" to ""
  /scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/utils/cpp_extension.py:428: UserWarning: There are no g++ version bounds defined for CUDA version 12.1
    warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
  building 'transformer_engine_torch' extension
  creating build/temp.linux-x86_64-cpython-311
  creating build/temp.linux-x86_64-cpython-311/transformer_engine
  creating build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch
  creating build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc
  creating build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions
  creating build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/multi_tensor
  creating build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/userbuffers
  /scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/common.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/common.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  /scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/activation.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/activation.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  /scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/apply_rope.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/apply_rope.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  /scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/attention.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/attention.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  /scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/cast.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/cast.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  /scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/gemm.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/gemm.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  /scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/misc.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/misc.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  /scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_adam.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_adam.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  /scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_l2norm_kernel.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_l2norm_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  /scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_scale_kernel.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_scale_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  /scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_sgd_kernel.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_sgd_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  /scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/normalization.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/normalization.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  gcc -pthread -B /scratch/user/u.tv216541/te-dev/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /scratch/user/u.tv216541/te-dev/include -fPIC -O2 -isystem /scratch/user/u.tv216541/te-dev/include -fPIC -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/pybind.cpp -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/pybind.o -O3 -fvisibility=hidden -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  /scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/recipe.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/recipe.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  /scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/softmax.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/softmax.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  /scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/extensions/transpose.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/extensions/transpose.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  gcc -pthread -B /scratch/user/u.tv216541/te-dev/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /scratch/user/u.tv216541/te-dev/include -fPIC -O2 -isystem /scratch/user/u.tv216541/te-dev/include -fPIC -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/ts_fp8_op.cpp -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/ts_fp8_op.o -O3 -fvisibility=hidden -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  gcc -pthread -B /scratch/user/u.tv216541/te-dev/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /scratch/user/u.tv216541/te-dev/include -fPIC -O2 -isystem /scratch/user/u.tv216541/te-dev/include -fPIC -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/userbuffers/ipcsocket.cc -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/userbuffers/ipcsocket.o -O3 -fvisibility=hidden -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  gcc -pthread -B /scratch/user/u.tv216541/te-dev/compiler_compat -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /scratch/user/u.tv216541/te-dev/include -fPIC -O2 -isystem /scratch/user/u.tv216541/te-dev/include -fPIC -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.cpp -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/userbuffers/userbuffers-host.o -O3 -fvisibility=hidden -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  /scratch/user/u.tv216541/te-dev/bin/nvcc -I/tmp/pip-req-build-jglepsmr/transformer_engine -I/tmp/pip-req-build-jglepsmr/transformer_engine/common -I/tmp/pip-req-build-jglepsmr/transformer_engine/common/include -I/tmp/pip-req-build-jglepsmr/transformer_engine/pytorch/csrc -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/torch/csrc/api/include -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/TH -I/scratch/user/u.tv216541/te-dev/lib/python3.11/site-packages/torch/include/THC -I/scratch/user/u.tv216541/te-dev/include -I/scratch/user/u.tv216541/te-dev/include/python3.11 -c transformer_engine/pytorch/csrc/userbuffers/userbuffers.cu -o build/temp.linux-x86_64-cpython-311/transformer_engine/pytorch/csrc/userbuffers/userbuffers.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options '-fPIC' -O3 -gencode arch=compute_70,code=sm_70 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_BFLOAT16_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ -U__CUDA_NO_BFLOAT162_OPERATORS__ -U__CUDA_NO_BFLOAT162_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --threads 4 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_90,code=sm_90 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=transformer_engine_torch -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
  /scratch/user/u.tv216541/te-dev/include/cuda_fp16.hpp(2724): error: invalid redeclaration of type name "nv_bfloat16" (declared at line 2837 of /scratch/user/u.tv216541/te-dev/include/cuda_bf16.hpp)
    typedef __half nv_bfloat16;
                   ^
  
  1 error detected in the compilation of "transformer_engine/pytorch/csrc/userbuffers/userbuffers.cu".
  /scratch/user/u.tv216541/te-dev/include/cuda_fp16.hpp(2724): error: invalid redeclaration of type name "nv_bfloat16" (declared at line 2837 of /scratch/user/u.tv216541/te-dev/include/cuda_bf16.hpp)
    typedef __half nv_bfloat16;
                   ^
  
  1 error detected in the compilation of "transformer_engine/pytorch/csrc/userbuffers/userbuffers.cu".
  error: command '/scratch/user/u.tv216541/te-dev/bin/nvcc' failed with exit code 255
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for transformer_engine Running setup.py clean for transformer_engine Failed to build transformer_engine ERROR: Could not build wheels for transformer_engine, which is required to install pyproject.toml-based projects

tarunvallabh avatar Aug 08 '24 15:08 tarunvallabh

Are you building the 1.9 release or the main branch? This looks like an error that was fixed with https://github.com/NVIDIA/TransformerEngine/pull/949.

If that doesn't fix it, perhaps it's something with the CUDA version? The error message says that cuda_fp16.hpp is replacing BF16 with FP16, which seems wrong to me. I haven't been able to easily dig up your CUDA version (12.1.66), but I don't see this logic in 12.1.55 or 12.1.105

timmoon10 avatar Aug 08 '24 22:08 timmoon10

I think the issue slipped into the TEv1.8 release as I had the same installation issue which was resolved by cherry-picking https://github.com/NVIDIA/TransformerEngine/pull/949.

viclzhu avatar Aug 09 '24 17:08 viclzhu

I've gone ahead and cherry-picked #949 into the 1.8 release.

timmoon10 avatar Aug 13 '24 22:08 timmoon10