cling icon indicating copy to clipboard operation
cling copied to clipboard

[CUDA] __constant__ memory declaration doesn't work

Open SimeonEhrig opened this issue 3 years ago • 34 comments

Problem

The following code doesn't works in cuda mode:

[cling]$ __constant__ int array[1024];

The following error is thrown:

input_line_3:2:2: error: __constant__ variables must be global
 __constant__ int array[1024];
 ^
/usr/local/cuda-8.0/include/host_defines.h:195:22: note: expanded from macro '__constant__'
        __location__(constant)

possible cause

The bug should be inside the cling instance. Maybe the following source code will jited internal:

void __cling_Un1Qu31(void* vpClingValue) {
 __constant__ int array[1024];
;
}
__constant__ int array[1024];

SimeonEhrig avatar Mar 10 '21 17:03 SimeonEhrig

We need to declare the constant variable as a global variable. But it is declared in the __cling_Un1Qu31 function which makes it a local variable.

void __cling_Un1Qu31(void* vpClingValue) { constant int array[1024]; ; }

so simply declaring it outside the function will fixe this issue.

constant int array[1024];

void __cling_Un1Qu31(void* vpClingValue) {

} this fixes #396

rohit11544 avatar Mar 10 '21 23:03 rohit11544

@rohit11544 I believe you misunderstand something. What I declare as user is a global variable

// declaration global space
// everything is in global C++ space, until you open a scope
[cling]$ __constant__ int array[1024];
// declaration local space, what I have not done
[cling]$ {
[cling]$ ?   __constant__ int array[1024];
[cling]$ }

Cling has a mechanism, which wraps statement. This allows to use statements in global space, which are normally not allowed in C++. For example:

[cling]$ void foo(){}
// Function calls are forbidden in C++ in global space. In Cling, it is allowed.  
[cling]$ foo();

For variables, this mechanism needs to be more clever, that you can use variables between different statements.

// this would not working, if every statement is simply wrapped up
[cling]$ int i = 1;
[cling]$ ++i;

For this there is a mechanism that brings the variable definition from the wrappers back to the global space. And my guess is that this process does not work cleanly and the local variable declaration is still left.

SimeonEhrig avatar Mar 11 '21 08:03 SimeonEhrig

@SimeonEhrig Sorry, that was my mistake I thought that __constant__ was inside a function like below

void __cling_Un1Qu31(void* vpClingValue) {
// here the constant is inside this function so I misunderstood it .
__constant__ int array[1024];
;
}

Thank you for the information Sir.

rohit11544 avatar Mar 11 '21 09:03 rohit11544

@SimeonEhrig Sorry, that was my mistake I thought that __constant__ was inside a function like below

void __cling_Un1Qu31(void* vpClingValue) {
// here the constant is inside this function so I misunderstood it .
__constant__ int array[1024];
;
}

Thank you for the information Sir.

No problem. The function name __cling_Un1Qu31 is typical for the wrapper mechanism, so I didn't add a hint that it is generated by Cling. For new developers this can be confusing.

SimeonEhrig avatar Mar 11 '21 10:03 SimeonEhrig

I would like to work on this issue.

sudo-panda avatar Mar 23 '21 15:03 sudo-panda

@SimeonEhrig could you tell me how do I get this when running cling:

void __cling_Un1Qu31(void* vpClingValue) {
 __constant__ int array[1024];
;
}
__constant__ int array[1024];

sudo-panda avatar Mar 23 '21 19:03 sudo-panda

@SimeonEhrig could you tell me how do I get this when running cling:

void __cling_Un1Qu31(void* vpClingValue) {
 __constant__ int array[1024];
;
}
__constant__ int array[1024];

Unfortunately, there is some time ago since I open this repo (I copied it from my fork). Therefore I'm note sure but I think, this code is just assumption, what is going on internal.

But I can give you an entry point for debugging. You can add a llvm::outs() << input << "\n"; at the beginning of this function. It will print the transformed c++ code:

****************** CLING ******************
* Type C++ code and press enter to run it *
*             Type .q to exit             *
*******************************************
[cling]$ __constant__ int array[1024];
void __cling_Un1Qu30(void* vpClingValue) {
 __constant__ int array[1024];
;
}
//...

The key class is DeclExtractor, because it is supposed to convert the code internally into something like this:

 __constant__ int array[1024];
void __cling_Un1Qu30(void* vpClingValue) {

;
}

Attention, the DeclExtractor is an AST-tranformer, which means it operates on the AST and not on the source code. To get (partial) C++ code back during debugging, you need to use dump or print functions.

SimeonEhrig avatar Mar 24 '21 17:03 SimeonEhrig

The PR #402 only fixes the problem that the __constant__ attribute can be parsed. The following example does not work:

#include <iostream>

__constant__ int cA[1];

__global__ void g(int *out){
  *out = cA[0];
}

int A[] = {42};
int *cA_ptr;
cudaGetSymbolAddress((void **)&cA_ptr, cA);
cudaMemcpy(cA_ptr, A, sizeof(int), cudaMemcpyHostToDevice);

int host = -1;
int *device;

cudaMalloc((void**)&device, sizeof(int));
g<<<1,1>>>(device);
cudaMemcpy(&host, device, sizeof(int), cudaMemcpyDeviceToHost);

// should print 42
std::cout << host << std::endl;

The result 0 and not 42. I think the problem is that each instruction is in a separate Translation Unit, because there is the same behavior in classic CUDA C++ when we use __constant__ memory in different TU without separated compilation, as in this Stack Overflow post.

I spent some time with this and was able to reproduce the problem with the nvcc when I did not enable separated compilation.

One solution to fix the problem could be to enable "separation compilation" in Cling. ~~Unfortunately I didn't find an option for it and I don't think the feature is implemented yet. There is an old llvm post asking for the feature unsuccessfully and in CMake it is also not supported for Clang as CUDA compiler.~~

Thanks to @psychocoderHPC , he found the separated compilation option in Clang: https://clang.llvm.org/docs/ClangCommandLineReference.html#cmdoption-clang-fgpu-rdc

Maybe there is another way, to make the variable visible in another TU. @Axel-Naumann @vgvassilev Do you know a mechanism which we can use to link the TU? Maybe linking the llvm modules?

This problem could be the same, like in Issue #395

SimeonEhrig avatar Apr 15 '21 19:04 SimeonEhrig

Sorry for the late reply. I will look into it.

sudo-panda avatar Apr 16 '21 19:04 sudo-panda

I tried passing the -fgpu-rdc to the CUDA compiler but it still fails.

sudo-panda avatar Apr 17 '21 19:04 sudo-panda

CC @ioanaif @hahnjo

vgvassilev avatar Apr 17 '21 19:04 vgvassilev

I tried passing the -fgpu-rdc to the CUDA compiler but it still fails.

I think the solution is more complicated than simply passing the flag. Right now I'm trying to compile a static application with Clang/LLVM to understand what the compiler actually does. But it still fails. Nevertheless, I already got some interesting information.

SimeonEhrig avatar Apr 18 '21 09:04 SimeonEhrig

Hi there, I implemented -fcuda-rdc in 2018 (meanwhile renamed to -fgpu-rdc), here are the main two reviews: https://reviews.llvm.org/D42921 and https://reviews.llvm.org/D42922 For the second review that implements the registration, I actually uploaded the toy library that I tested, and it still seems to work (if you get clang to find and accept your installed CUDA version...)

The important point to notice is that you still need to link with nvcc because I didn't spend time figuring out how nvlink needs to be called (the tool nvlink, not the interconnect; who gets to choose names at Nvidia?!). So for integrating relocatable device code into Cling, I think you'll first need to reverse-engineer what magic happens at link time...

hahnjo avatar Apr 19 '21 09:04 hahnjo

Hi there, I implemented -fcuda-rdc in 2018 (meanwhile renamed to -fgpu-rdc), here are the main two reviews: https://reviews.llvm.org/D42921 and https://reviews.llvm.org/D42922 For the second review that implements the registration, I actually uploaded the toy library that I tested, and it still seems to work (if you get clang to find and accept your installed CUDA version...)

Thank you very much for the information. That's really help full. I have a question. I have one question. If I read the test cases correctly, external __constant__ var and external __variable__ var are supported or am I wrong? I ask because I'm having problems with code I compile with Clang. However, there is a chance that I am doing something wrong.

The important point to notice is that you still need to link with nvcc because I didn't spend time figuring out how nvlink needs to be called (the tool nvlink, not the interconnect; who gets to choose names at Nvidia?!). So for integrating relocatable device code into Cling, I think you'll first need to reverse-engineer what magic happens at link time...

For the first step, I think I will write the ptx code to a file and use nvlink via an external function call. Then let's see how I can solve the problem. Maybe I can find something in the CUDA headers or I can use the driver API. If not, then I'll have to do the reverse engineering. And yes, 'nvlink' is a stupid name in this case. Someone must not know search engines at Nvidia :thinking:

SimeonEhrig avatar Apr 19 '21 19:04 SimeonEhrig

@SimeonEhrig

If I read the test cases correctly, external constant var and external variable var are supported or am I wrong?

If you're referring to clang/test/CodeGenCUDA/device-stub.cu, I think this only tests that Clang thinks it's correctly registering. I'm not aware of tests that actually exercise this code from a runtime perspective, like really executing on the GPU.

However, it seems to work as far as I can tell. I put together a small toy application that defines a __constant__ in one TU and a __global__ kernel in a second, that is called from a third file. I'm attaching it here for reference: gpu-rdc-external.tar.gz This gives:

The answer is 42 :)

hahnjo avatar Apr 20 '21 10:04 hahnjo

@SimeonEhrig

If I read the test cases correctly, external constant var and external variable var are supported or am I wrong?

If you're referring to clang/test/CodeGenCUDA/device-stub.cu, I think this only tests that Clang thinks it's correctly registering. I'm not aware of tests that actually exercise this code from a runtime perspective, like really executing on the GPU.

However, it seems to work as far as I can tell. I put together a small toy application that defines a __constant__ in one TU and a __global__ kernel in a second, that is called from a third file. I'm attaching it here for reference: gpu-rdc-external.tar.gz This gives:

The answer is 42 :)

Thanks for the example. Unfortunately, it does not compile with the same error, like in my code:

clang -fcuda-rdc --cuda-gpu-arch=sm_60 --no-cuda-version-check -Wno-unknown-cuda-version -c test-rdc.cu -o test-rdc.o
clang -fcuda-rdc --cuda-gpu-arch=sm_60 --no-cuda-version-check -Wno-unknown-cuda-version -c external.cu -o external.o
clang -fcuda-rdc --cuda-gpu-arch=sm_60 --no-cuda-version-check -Wno-unknown-cuda-version -c kernel.cu -o kernel.o
nvcc --gpu-architecture sm_60 test-rdc.o external.o kernel.o -o test-rdc
/opt/spack/opt/spack/linux-linuxmint20-zen2/gcc-9.3.0/binutils-2.35.1-v5wl526ejtqmfmsv73f6ovfwkv2tln4j/bin/ld: test-rdc.o: warning: relocation in read-only section `.text'
/opt/spack/opt/spack/linux-linuxmint20-zen2/gcc-9.3.0/binutils-2.35.1-v5wl526ejtqmfmsv73f6ovfwkv2tln4j/bin/ld: test-rdc.o: in function `main':
test-rdc.cu:(.text+0x20): undefined reference to `Answer'
/opt/spack/opt/spack/linux-linuxmint20-zen2/gcc-9.3.0/binutils-2.35.1-v5wl526ejtqmfmsv73f6ovfwkv2tln4j/bin/ld: warning: creating DT_TEXTREL in a PIE
collect2: error: ld returned 1 exit status
make: *** [Makefile:10: test-rdc] Error 1

I tested it on 3 different systems:

  • Clang 11 + CUDA 10.1
  • Clang 12 + CUDA 10.1 or CUDA 11.2
  • CUDA 11.2 docker container with Clang 11

SimeonEhrig avatar Apr 20 '21 14:04 SimeonEhrig

Sorry, can't help there, my compilation is on a standard CentOS 8. But it adds another point why I don't like Spack...

hahnjo avatar Apr 20 '21 15:04 hahnjo

I tested it in a container to exclude spack. The complete software is installed over apt in the container. But all my system are Ubuntu based. I will try it with a CentOS 8 container again.

SimeonEhrig avatar Apr 20 '21 15:04 SimeonEhrig

I tried it with a CentOS 8 container, because I have no access to bare metal CentOS 8 system, but I got the same error:

$ docker run --gpus=all -it nvidia/cuda:10.1-devel-centos8 bash
$ yum install llvm-toolset
# ...
$ make
make
clang -fcuda-rdc --cuda-gpu-arch=sm_60 --no-cuda-version-check -Wno-unknown-cuda-version -c test-rdc.cu -o test-rdc.o
clang -fcuda-rdc --cuda-gpu-arch=sm_60 --no-cuda-version-check -Wno-unknown-cuda-version -c external.cu -o external.o
clang -fcuda-rdc --cuda-gpu-arch=sm_60 --no-cuda-version-check -Wno-unknown-cuda-version -c kernel.cu -o kernel.o
nvcc --gpu-architecture sm_60 test-rdc.o external.o kernel.o -o test-rdc
test-rdc.o: In function `main':
test-rdc.cu:(.text+0x20): undefined reference to `Answer'
collect2: error: ld returned 1 exit status
make: *** [Makefile:10: test-rdc] Error 1

Used software:

  • CUDA 10.1.243
  • Clang 10.0.1

SimeonEhrig avatar Apr 20 '21 17:04 SimeonEhrig

Hm, that's complaining on the host, isn't it? I just double-checked the API reference, could you try

cudaMemcpyToSymbol(&Answer, &answer, sizeof(answer));

instead, taking the address of Answer? My CUDA 11.2 has a convenience function that takes a reference as the first argument and passes its address on to the real function, maybe that wasn't available in CUDA 10.1?

hahnjo avatar Apr 20 '21 20:04 hahnjo

I changed the line cudaMemcpyToSymbol(Answer, &answer, sizeof(answer));to cudaMemcpyToSymbol(&Answer, &answer, sizeof(answer)); but the undefined reference error exists. I used CUDA 11.2 and Clang 12.0.0

SimeonEhrig avatar Apr 26 '21 19:04 SimeonEhrig

@hahnjo Can you please send me your used software version, that I can exactly reproduce it.

  • CUDA version
  • Clang version
  • libc++ or libstdc++ and which version
  • OS version

SimeonEhrig avatar Apr 29 '21 07:04 SimeonEhrig

@SimeonEhrig I'm on CentOS 8.3 with CUDA 11.2 installed from Nvidia's repository. I'm using clang-13 from a few weeks ago, the clang-10 from the package system doesn't know about that version of CUDA. I'm using the libstdc++ from the system GCC 8.3.1

hahnjo avatar Apr 29 '21 07:04 hahnjo

@SimeonEhrig I'm on CentOS 8.3 with CUDA 11.2 installed from Nvidia's repository. I'm using clang-13 from a few weeks ago, the clang-10 from the package system doesn't know about that version of CUDA. I'm using the libstdc++ from the system GCC 8.3.1

With your settings, it works. Now, I have to find out what is the reason, why it works and not on my bare metal system. I don't believe, it has something to do with CentOS 8. I believe, there was something change since Clang 12 or it has something to do with the libstdc++ (it is not related to this issue, but we saw a lot of problems with Clang, CUDA and the libstdc++ of GCC 10.3).

SimeonEhrig avatar May 03 '21 07:05 SimeonEhrig

Indeed, I can reproduce this with Clang 12.0 and CUDA 11.2. The winner is https://github.com/llvm/llvm-project/commit/b008ea304d438f0aa818918caceb3bd864412304#diff-4d89d3c6d6aa954d6d77426520d881efefc4eb631dbe25730c4599c0b5150ec6, after the refactoring in https://github.com/llvm/llvm-project/commit/0b2af1a2889423bb797856841ac81cf10d01c696. For this particular example with only -fgpu-rdc, commenting out this single line solves the issue:

--- clang/lib/CodeGen/CodeGenModule.cpp.orig    2021-05-03 10:46:31.107160353 +0200
+++ clang/lib/CodeGen/CodeGenModule.cpp 2021-05-03 10:46:35.094206320 +0200
@@ -4302,7 +4302,7 @@
       // be internal in order to prevent name conflicts with global
       // host variables with the same name in a different TUs.
       if (D->hasAttr<CUDADeviceAttr>() || D->hasAttr<CUDAConstantAttr>()) {
-        Linkage = llvm::GlobalValue::InternalLinkage;
+        // Linkage = llvm::GlobalValue::InternalLinkage;
         // Shadow variables and their properties must be registered with CUDA
         // runtime. Skip Extern global variables, which will be registered in
         // the TU where they are defined.

hahnjo avatar May 03 '21 08:05 hahnjo

Indeed, I can reproduce this with Clang 12.0 and CUDA 11.2. The winner is llvm/llvm-project@b008ea3#diff-4d89d3c6d6aa954d6d77426520d881efefc4eb631dbe25730c4599c0b5150ec6, after the refactoring in llvm/llvm-project@0b2af1a. For this particular example with only -fgpu-rdc, commenting out this single line solves the issue:

--- clang/lib/CodeGen/CodeGenModule.cpp.orig    2021-05-03 10:46:31.107160353 +0200
+++ clang/lib/CodeGen/CodeGenModule.cpp 2021-05-03 10:46:35.094206320 +0200
@@ -4302,7 +4302,7 @@
       // be internal in order to prevent name conflicts with global
       // host variables with the same name in a different TUs.
       if (D->hasAttr<CUDADeviceAttr>() || D->hasAttr<CUDAConstantAttr>()) {
-        Linkage = llvm::GlobalValue::InternalLinkage;
+        // Linkage = llvm::GlobalValue::InternalLinkage;
         // Shadow variables and their properties must be registered with CUDA
         // runtime. Skip Extern global variables, which will be registered in
         // the TU where they are defined.

Thanks for searching the bug. I had also considered this type of bug to be the most likely. Therefore, we need to back port it to Cling.

SimeonEhrig avatar May 03 '21 08:05 SimeonEhrig

@hahnjo, @SimeonEhrig, what would be a way forward? Is there something to be fixed/backported from clang mainline?

vgvassilev avatar May 03 '21 13:05 vgvassilev

IMHO we would have a very hard time back-porting from current main to LLVM 9: That particular are of Clang is developing very rapidly since AMD is adding their HIP models. As a particular example, take the refactoring that I linked to above...

hahnjo avatar May 03 '21 13:05 hahnjo

Ok, so I guess the guidance is to wait the next upgrade to llvm 12?

vgvassilev avatar May 03 '21 14:05 vgvassilev

LLVM 13. The fix is in the current dev. @vgvassilev What do you think, is the next version, where we want to upgrade? If I understand it currently, we are strongly depend on the new OrcJIT v2 backend.

SimeonEhrig avatar May 03 '21 14:05 SimeonEhrig