fancyIX comments

Results 27 comments of


                                            fancyIX

Is it a good idea to use GCN cross lane instruction for optimization?

Seems like even sub_group functions are not used for AMD but only for Intel. I mean https://bashbaug.github.io/OpenCL-Docs/html/OpenCL_Ext.html

Is it a good idea to use GCN cross lane instruction for optimization?

> Is this related to the subgroup shuffling, which is already implemented for NVIDIA and Intel but not used for AMD? Partially yes. Also there are lots of LDS reading/writing....

Is it a good idea to use GCN cross lane instruction for optimization?

Refer to: https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/

Is it a good idea to use GCN cross lane instruction for optimization?

@CNugteren without modifying the logic much, just replacing LDS r/w, not sure if that can improve the performance a lot. Seems like "invert" and "transpose" can be improved a lot....

Is it a good idea to use GCN cross lane instruction for optimization?

Is "shuffle" can be applied to any opencl kernel? Any candidate kernel to investigate on?

Is it a good idea to use GCN cross lane instruction for optimization?

Found this article interesting: https://cnugteren.github.io/tutorial/pages/page10.html Not sure in current impelmentation what this shfl logic is. Maybe here: https://github.com/CNugteren/CLBlast/blob/bcd294a93ad0dffbace51103215b1346ec3956df/src/kernels/level3/xgemm_part3.opencl#L240 seems like we can replace it with AMD opencl's extension for subgroup...

Is it a good idea to use GCN cross lane instruction for optimization?

@CNugteren while I am working on a PR for using cross lane instruction to do subgroup shuffling, I have a question: https://github.com/CNugteren/CLBlast/blob/bcd294a93ad0dffbace51103215b1346ec3956df/src/kernels/level3/xgemm_part3.opencl#L47C3-L47C3 Here seems like the instruction: shfl.sync.idx.b32 only works...

Is it a good idea to use GCN cross lane instruction for optimization?

Current AMD PR doesn't work with precision 64 when there needs two registers for double number. I will change the PR. But still don't know if current Nvidia implementation works....

Is it a good idea to use GCN cross lane instruction for optimization?

@tyler-utah what do you think? https://github.com/CNugteren/CLBlast/blob/bcd294a93ad0dffbace51103215b1346ec3956df/src/kernels/level3/xgemm_part3.opencl#L47C3-L47C3 It's only using one instruction with 32 bit operand. How that supposed to work with 64 bit precision or N greater than 2?

Navi card subgroup shuffle support for gemm

@CNugteren code is ready for review. Have run xgemm tuner on 6900 XT. Performance improvement on round 3: With change: * Found best result 1.12 ms: 1913.8 GFLOPS * Best...