Dynamic/static shared memory support
I have been evaluating this project recently and was curious what the status of dynamic and static shared memory support is? Playing around with some test code, I can’t get the compiled ptx to emit the proper linkage for either.
I’d like to offer my help in getting these features implemented as I would very much like to use them in a project I’m working on.
I'm also interested in accessing shared memory. Here's what I have from a little digging around:
- dereferencing SharedPointer!T variables emits the proper op codes in the PTX file, ld.shared and st.shared.
- dereferencing Shared!(uint[32]) variables or similar yields ld.local and st.local instructions in the PTX so no-go there.
- static shared variables are, reportedly, declared with .shared directives at the start of PTX files.
- two PTX special registers report on the amount of shared memory in play (256 byte granularity with recent compute capability)
I've got a couple of hacks to try, and I'll keep digging, but help is always appreciated.
Sorry I haven’t gotten back until now @bcarneal. Have you made any progress? The current main issue with shared memory support right now lies in LDC. I opened an issue covering what I’ve found over in that repo: https://github.com/ldc-developers/ldc/issues/3499
We need a way to make LDC emit the proper shared linkage outlined in the above issue when accessing either static or dynamic shared mem.
I personally believe having
Shared!(uint[32]) var;
should generate the static linkage, and perhaps something like
extern SharedPointer!T var;
To emit the dynamically linked code (this mimics CUDA).
The dcompute memory address structs like Shared and SharedPointer are special cased internally in LDC and last time I was hacking on it I was having trouble getting it view these structs as their underlying pointer instead of a value type. It’s been a while since I’ve last looked into this so my memory is probably a bit rusty.
Very little to add to my earlier post. Currently I'm using per-block scratch areas from appropriately aligned global memory for any cooperative work, so not much rush here.
For programmer managed cache access I'd try to bring up a 3? liner injected in to the nvptx file post compilation: extern C void* nvptxDynSharedMemBasePointer() or some such. Hopefully it's as simple as knowing how to return a value from the internal register.
works if the semantic checker allows string literals.
https://github.com/ldc-developers/ldc/blob/master/gen/semantic-dcompute.cpp#L152
SharedPointer!T sharedStaticReserve(T : T[N], string uniqueName, size_t N)(){
void* address = __irEx!(`@`~uniqueName~` = addrspace(3) global [`~Itoa!N~` x `~llvmType!T~`] zeroinitializer, align 4 ;
%Dummy = type { `~llvmType!T~` addrspace(3)* }
`, `
%sharedptr = getelementptr inbounds [`~Itoa!N~` x `~llvmType!T~`], [`~Itoa!N~` x `~llvmType!T~`] addrspace(3)* @`~uniqueName~`, `~llvmType!T~` 0, i64 0
%.structliteral = alloca %Dummy, align 8
%dumptr = getelementptr inbounds %Dummy, %Dummy* %.structliteral, i32 0, i32 0
store `~llvmType!T~` addrspace(3)* %sharedptr, `~llvmType!T~` addrspace(3)** %dumptr
%vptr = bitcast %Dummy* %.structliteral to i8*
ret i8* %vptr
`, ``, void*)();
return *(cast(SharedPointer!(uint)*)address);
}
package:
immutable(string) Digit(size_t n)()
{
static if(n == 0)
return 0.stringof;
else static if(n == 1)
return 1.stringof;
else static if(n == 2)
return 2.stringof;
else static if(n == 3)
return 3.stringof;
else static if(n == 4)
return 4.stringof;
else static if(n == 5)
return 5.stringof;
else static if(n == 6)
return 6.stringof;
else static if(n == 7)
return 7.stringof;
else static if(n == 8)
return 8.stringof;
else static if(n == 9)
return 9.stringof;
else static assert(0);
}
immutable(string) Itoa(uint n)()
{
static if(n < 0){
enum ret = "-" ~ Itoa!(-n);
return ret;
}
else static if (n < 10){
enum ret = Digit!(n);
return ret;
}
else{
enum ret = Itoa!(n / 10) ~ Digit!(n % 10);
return ret;
}
}
immutable(string) llvmType(T)()
{
static if(is(T == float))
return "float";
else static if(is(T == double))
return "double";
else static if(is(T == byte) || is(T == ubyte) || is(T == void))
return "i8";
else static if(is(T == short) || is(T == ushort))
return "i16";
else static if(is(T == int) || is(T == uint))
return "i32";
else static if(is(T == long) || is(T == ulong))
return "i64";
else
static assert(0,
"Can't determine llvm type for D type " ~ T.stringof);
}