cutlass [QST]What is stack and heap in GPU for cute? How is "make

I see this from a blog, I am not sure whether it is correct, and I am also very interested in detailed make_tensor function usage. By the way, shared memory, global memory, register, which is heap, stack?

Tensor Creation

Stack Object: Requires specifying both type and Layout, where Layout must have a static shape.
Tensor make_tensor<T>(Layout layout);

Heap Object: Requires specifying a pointer and Layout, where Layout can be either dynamic or static.
Tensor make_tensor(Pointer pointer, Layout layout);

Stack Object: The tensor's layout must be static.
Tensor make_tensor_like(Tensor tensor);

Stack Object: The tensor's layout must be static.
Tensor make_fragment_like(Tensor tensor);

Using make_tensor, one can conveniently create Tensors. There are two common ways to construct a Tensor. The first is as a stack object, as in the first form above. The second is as a heap object, implemented by specifying a pointer. The linear address pointer (pointer) describes the layout of the data pointed to by the pointer, which can be multi-hierarchical. This pointer can be generated using make_gmem_ptr, make_smem_ptr. It's important to note that stack objects must be static, while heap objects can be either dynamic or static. There are no dynamic stack structures. The overall concept is summarized in the table below:

Dec 05 '23 08:12 ziyuhuang123

which blog is this from? I cannot find this in our docs anywhere.

I don't think the stack/heap concepts from the CPU world translates that directly to GPUs. "Stack" is usually local memory when you spill, and there is no good analog for heap. Registers are the closest to "stack" in usage model as they are allocated for automatic storage. I guess gmem can be considered heap, but this analogy does not really help much

Dec 05 '23 15:12 thakkarV

Off topic: Just came across this issue (as a github-mancer). Based on your recent questions I assume you want to write gemm from ground up. And not to be offense, almost all blogs about cuda gemm optimization (and most papers that mention it) are just garbage and provide no value.

There are only 3 outliers:

https://github.com/NervanaSystems/maxas/wiki/SGEMM but is severely outdated
https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/ is the one that you should refer to as a ground truth. It lacks some "why you should do the tiling", the pipeline design is what you should follow.
https://zhuanlan.zhihu.com/p/441146275 extremely optimized but still missing the "why you should do the tiling" and only provides a final result. So the new comer can only get very little information from it, it is, too advanced for the very newbie.

I have some notes about it also https://cloudhan.github.io/20231104225246.html but is not polished at the moment. I locally have a version that is a full rewrite with cute but not public.

Back to your question, CUDA (or more generally, GPU) programming generally don't mention the idea of the call stack (or sometime even the loop). That is because the code is heavily unrolled and inlined by the compiler. cute programming can be compiled as cpu or kernel code. On cpu, Tensor make_tensor<T>(Layout layout); is indeed a stack object. On GPU, if the layout is compile time value, like make_layout(make_shape(_4{}, _1{}), then it will be live in registers as the compiler will make this happen unless the registers are been spilled to local memory. If the layout is runtime value, I believe this will cause compile error.

Dec 06 '23 04:12 cloudhan

Emmm, thank you, I have read all three blogs you mentioned, but you are discussing cuda core ..... I am learning tensor core so I am reading cutlass. ?

Dec 06 '23 06:12 ziyuhuang123

Off topic: Just came across this issue (as a github-mancer). Based on your recent questions I assume you want to write gemm from ground up. And not to be offense, almost all blogs about cuda gemm optimization (and most papers that mention it) are just garbage and provide no value.

There are only 3 outliers:

https://github.com/NervanaSystems/maxas/wiki/SGEMM but is severely outdated

https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/ is the one that you should refer to as a ground truth. It lacks some "why you should do the tiling", the pipeline design is what you should follow.

https://zhuanlan.zhihu.com/p/441146275 extremely optimized but still missing the "why you should do the tiling" and only provides a final result. So the new comer can only get very little information from it, it is, too advanced for the very newbie.

I have some notes about it also https://cloudhan.github.io/20231104225246.html but is not polished at the moment. I locally have a version that is a full rewrite with cute but not public.

Back to your question, CUDA (or more generally, GPU) programming generally don't mention the idea of the call stack (or sometime even the loop). That is because the code is heavily unrolled and inlined by the compiler. cute programming can be compiled as cpu or kernel code. On cpu, Tensor make_tensor<T>(Layout layout); is indeed a stack object. On GPU, if the layout is compile time value, like make_layout(make_shape(_4{}, _1{}), then it will be live in registers as the compiler will make this happen unless the registers are been spilled to local memory. If the layout is runtime value, I believe this will cause compile error.

Seems you are a gemm expert! Thank you for your reply! Have you learnt tensor core gemm or cute recently? How you learn it? Actually I also have a wechat group to discuss cute, if you are interested.

Dec 07 '23 02:12 ziyuhuang123

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Jan 06 '24 03:01 github-actions[bot]

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Apr 05 '24 03:04 github-actions[bot]

cutlass
cutlass copied to clipboard

[QST]What is stack and heap in GPU for cute? How is "make_tensor" function used in cute?

cutlass cutlass copied to clipboard

[QST]What is stack and heap in GPU for cute? How is "make_tensor" function used in cute?

cutlass
cutlass copied to clipboard