TransformerEngine build problem torch 2.x latest gcc12

Hi Folks,

Hitting strange issue. Did you try to build it with torch 2.x

/home/spyroot/miniconda3/envs/test/lib/python3.10/site-packages/torch/include/pybind11/detail/../cast.h:42:120: error: expected template-name before ‘<’ token
   42 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                                        ^
/home/spyroot/miniconda3/envs/test/lib/python3.10/site-packages/torch/include/pybind11/detail/../cast.h:42:120: error: expected identifier before ‘<’ token
/home/spyroot/miniconda3/envs/test/lib/python3.10/site-packages/torch/include/pybind11/detail/../cast.h:42:123: error: expected primary-expression before ‘>’ token
   42 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                                           ^
/home/spyroot/miniconda3/envs/test/lib/python3.10/site-packages/torch/include/pybind11/detail/../cast.h:42:126: error: expected primary-expression before ‘)’ token
   42 |     return caster.operator typename make_caster<T>::template cast_op_type<T>();
      |                                                                                                               
    ```

Compile with Cuda 12.1 and didn't hit issue anything else.

CUDA_DIR=/usr/local/cuda
PATH="$CUDA_DIR/bin:$PATH"
CXXFLAGS='-Wno-maybe-uninitialized -Wno-uninitialized -Wno-free-nonheap-object -Wno-nonnull'
CFLAGS='-Wno-maybe-uninitialized -Wno-uninitialized -Wno-free-nonheap-object -Wno-nonnull'
TORCH_CUDA_ARCH_LIST="8.0 8.6 8.7 8.9 9.0"
CMAKE_CUDA_ARCHITECTURES="80;86;87;89;90"
CMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
CMAKE_BUILD_TYPE=Release
python setup.py build -j 4

/home/spyroot/miniconda3/envs/test/lib/python3.10/site-packages/setuptools/dist.py:529: UserWarning: Normalizing '0.9.0dev ' to '0.9.0.dev0' warnings.warn(tmpl.format(**locals())) running build running build_py running build_ext Building CMake extensions! Running CMake in build/temp.linux-x86_64-cpython-310/Release: cmake /home/spyroot/dev/build/test/TransformerEngine/transformer_engine -DCMAKE_BUILD_TYPE=Release -DCMAKE_LIBRARY_OUTPUT_DIRECTORY_RELEASE=/home/spyroot/dev/dev/test/TransformerEngine/build/lib.linux-x86_64-cpython-310 cmake --build . --config Release -- cudnn found at /usr/lib/x86_64-linux-gnu/libcudnn.so. -- cudnn_adv_infer found at /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so. -- cudnn_adv_train found at /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so. -- cudnn_cnn_infer found at /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so. -- cudnn_cnn_train found at /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so. -- cudnn_ops_infer found at /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so. -- cudnn_ops_train found at /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so. -- cuDNN: /usr/lib/x86_64-linux-gnu/libcudnn.so -- cuDNN: /usr/include -- Configuring done -- Generating done -- Build files have been written to: /home/spyroot/dev/build/test/TransformerEngine/build/temp.linux-x86_64-cpython-310/Release

May 09 '23 17:05 spyroot

Hmm, the error seems to come from the pyTorch header, not TE. We do build TE with pyTorch 2 in our CI (current NGC pyTorch containers are based on pyTorch 2), although I don't believe we tried GCC 12. Could you try with GCC 11 to see if that makes a difference?

May 09 '23 21:05 ptrendx

It's a known bug, pybind11 patched it but it isn't in an official PR yet; they have a patch available @ https://github.com/pybind/pybind11/pull/4893

Nov 18 '23 22:11 NeedsMoar

Closing based on the previous comment - pybind PR is already merged.

May 16 '24 16:05 ptrendx