how-to-optimize-gemm issues

how to overlap the share2register and computing process?

6

I have another question about MMult_cuda_12.cu Honestly, I don't know how to overlap the share2register and computing process. Is it the asm(PTX) that make them run parallelly? The instructions are...

YijiaZhao

question

about ldg32_nc_0

3

https://github.com/tpoisonooo/how-to-optimize-gemm/blob/master/cuda/MMult_cuda_12.cu: 20,21 I'm a beginner of CUDA&&PTX, I want to know what does these two PTX use for? "{.reg .pred p;\n" "mov.b32 %0, 0;\n" is it useless code?

YijiaZhao

关于测试浮点峰值的问题

2

![image](https://user-images.githubusercontent.com/47274616/188069802-24f1f559-286b-44c4-8f25-bdcdeeb3f860.png) 我现在跑的芯片型号是NVIDIA，ARMv8 Processor rev 0 (v8l)。我看知乎文章里说测试浮点峰值时FMA指令的排布数量 = FMA的发射数 * FMA指令的延迟。我并没有查到上面这个芯片的手册。但是我看了A57的手册，里面是这样记录的： ![image](https://user-images.githubusercontent.com/47274616/188070445-e1f460f6-08ae-4913-ad7a-3cf041402565.png) FMA指令的延迟是10，吞吐量是2。我不太清楚这个吞吐是否代表着芯片可以同时发射两条FMA指令（是芯片发射吗），但是我分别放置了10条FMA指令（OP_FLOATS = 80）和20条FMA指令（OP_FLOATS = 160）都测试了，发现在10条的时候是16.095492 GFLOPS， 20条是 18.759214 GFLOPS。这是什么原因呢？我的猜测有两个： 1.10条FMA指令确实不是测试这款芯片的浮点峰值所需要的指令数。 2.可能编译器自动开启了多线程？这个比较有可能，因为从4条指令到10条指令性能差不多翻倍，但是10-20只增加了一点。

jeezrick