[BUG]: Build errors from tensormap_replace.h due to `NV_IF_ELSE_TARGET(NV_HAS_FEATURE_SM_90a, ...`
Is this a duplicate?
- [X] I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct
Type of Bug
Compile-time Error
Component
libcu++
Describe the bug
When using cccl from the main branch, I'm hitting compile errors such as these:
/home/lcw/micromamba/envs/myenv/include/cuda/std/detail/libcxx/include/__cuda/ptx/instructions/tensormap_replace.h(56): error: expected an expression
{ _NV_IF__NV_TARGET_BOOL_NV_HAS_FEATURE_SM_90a(( asm ( "tensormap.replace.tile.global_address.global.b1024.b64 [%0], %1;" : : "l"(__as_ptr_gmem(__tm_addr)), "l"(__as_b64(__new_val)) : "memory" ); ), ( __cuda_ptx_tensormap_replace_global_address_is_not_supported_before_SM_90a__(); )) }
^
/home/lcw/micromamba/envs/myenv/include/cuda/std/detail/libcxx/include/__cuda/ptx/instructions/tensormap_replace.h(56): error: expected a ")"
{ _NV_IF__NV_TARGET_BOOL_NV_HAS_FEATURE_SM_90a(( asm ( "tensormap.replace.tile.global_address.global.b1024.b64 [%0], %1;" : : "l"(__as_ptr_gmem(__tm_addr)), "l"(__as_b64(__new_val)) : "memory" ); ), ( __cuda_ptx_tensormap_replace_global_address_is_not_supported_before_SM_90a__(); )) }
^
/home/lcw/micromamba/envs/myenv/include/cuda/std/detail/libcxx/include/__cuda/ptx/instructions/tensormap_replace.h(56): error: identifier "_NV_IF__NV_TARGET_BOOL_NV_HAS_FEATURE_SM_90a" is undefined
{ _NV_IF__NV_TARGET_BOOL_NV_HAS_FEATURE_SM_90a(( asm ( "tensormap.replace.tile.global_address.global.b1024.b64 [%0], %1;" : : "l"(__as_ptr_gmem(__tm_addr)), "l"(__as_b64(__new_val)) : "memory" ); ), ( __cuda_ptx_tensormap_replace_global_address_is_not_supported_before_SM_90a__(); )) }
^
/home/lcw/micromamba/envs/myenv/include/cuda/std/detail/libcxx/include/__cuda/ptx/instructions/tensormap_replace.h(56): error: expected an expression
{ _NV_IF__NV_TARGET_BOOL_NV_HAS_FEATURE_SM_90a(( asm ( "tensormap.replace.tile.global_address.global.b1024.b64 [%0], %1;" : : "l"(__as_ptr_gmem(__tm_addr)), "l"(__as_b64(__new_val)) : "memory" ); ), ( __cuda_ptx_tensormap_replace_global_address_is_not_supported_before_SM_90a__(); )) }
^
/home/lcw/micromamba/envs/myenv/include/cuda/std/detail/libcxx/include/__cuda/ptx/instructions/tensormap_replace.h(56): error: expected a ";"
{ _NV_IF__NV_TARGET_BOOL_NV_HAS_FEATURE_SM_90a(( asm ( "tensormap.replace.tile.global_address.global.b1024.b64 [%0], %1;" : : "l"(__as_ptr_gmem(__tm_addr)), "l"(__as_b64(__new_val)) : "memory" ); ), ( __cuda_ptx_tensormap_replace_global_address_is_not_supported_before_SM_90a__(); )) }
^
How to Reproduce
- Copy the include directory of libcudacxx (main branch) into my conda env
- Write a file that has
#include <cuda/ptx> - Build it using a command like:
nvcc --generate-dependencies-with-compile --dependency-output build/my_kernel.o.d -I/home/lcw/micromamba/envs/myenv/lib/python3.10/site-packages/torch/include -I/home/lcw/micromamba/envs/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/lcw/micromamba/envs/myenv/lib/python3.10/site-packages/torch/include/TH -I/home/lcw/micromamba/envs/myenv/lib/python3.10/site-packages/torch/include/THC -I/home/lcw/micromamba/envs/myenv/include -I/data/home/lcw/micromamba/envs/dino2/include/python3.10 -c -c my_kernel.cu -o build/my_kernel.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DNDEBUG -O3 -lineinfo -std=c++20 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -U__CUDA_NO_BFLOAT16_CONVERSIONS__ --expt-relaxed-constexpr --expt-extended-lambda --use_fast_math --ptxas-options=-v -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=my_kernel -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_90a,code=sm_90a -ccbin /home/lcw/micromamba/envs/myenv/bin/x86_64-conda-linux-gnu-cc
Expected behavior
It builds.
Reproduction link
No response
Operating System
Ubuntu 20.04.1
nvidia-smi output
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA H100 80GB HBM3 On | 00000000:53:00.0 Off | 0 |
| N/A 25C P0 66W / 700W | 2MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA H100 80GB HBM3 On | 00000000:64:00.0 Off | 0 |
| N/A 26C P0 66W / 700W | 2MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA H100 80GB HBM3 On | 00000000:75:00.0 Off | 0 |
| N/A 26C P0 65W / 700W | 2MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA H100 80GB HBM3 On | 00000000:86:00.0 Off | 0 |
| N/A 27C P0 65W / 700W | 2MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA H100 80GB HBM3 On | 00000000:97:00.0 Off | 0 |
| N/A 27C P0 66W / 700W | 2MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA H100 80GB HBM3 On | 00000000:A8:00.0 Off | 0 |
| N/A 25C P0 68W / 700W | 2MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA H100 80GB HBM3 On | 00000000:B9:00.0 Off | 0 |
| N/A 25C P0 64W / 700W | 2MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA H100 80GB HBM3 On | 00000000:CA:00.0 Off | 0 |
| N/A 25C P0 65W / 700W | 2MiB / 81559MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
NVCC version
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Nov_22_10:17:15_PST_2023
Cuda compilation tools, release 12.3, V12.3.107
Build cuda_12.3.r12.3/compiler.33567101_0
$ x86_64-conda-linux-gnu-cc --version
x86_64-conda-linux-gnu-cc (conda-forge gcc 12.3.0-5) 12.3.0
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Oh, I believe you are seeing a mismatch between installed driver and the toolkit you are using.
Looking at your nvidia-smi output you are running with
CUDA Version: 12.2
However, you are compiling with nvcc 12.3
Cuda compilation tools, release 12.3, V12.3.107
Now the issue is that we are enabling features based on what the compiler supports, which is the only information we have at compiletime. In this case, the 12.2 driver only supports PTX ISA 8.2, which does not include those instructions.
However, your toolkit is providing PTX ISA 8.3 which does so and you are compiling against a target that also could use them.
Long story short, I believe you need to update you driver on that machine
@lw also to be sure, can you verify that you also properly added the nv subfolder from the libcudacxx folder to the include path?
we did add changes to the <nv/target> header that need to be included
Driver version isn't the issue here. It's as @miscco said, there are likely mismatched versions of nv/ and cuda/ headers, because when I try a simpler reproducer with the entire CCCL library, it works just fine: https://godbolt.org/z/sd9PehKWG
@lw CCCL components are not independent and so vendoring just the libcudacxx/include/cuda headers into your conda environment won't work.
@lw Can we close this bug?
I tried again, making sure I copy both the cuda and nv subdirectories, and all seems to work. Closing.