fancyIX
fancyIX
Seems like even sub_group functions are not used for AMD but only for Intel. I mean https://bashbaug.github.io/OpenCL-Docs/html/OpenCL_Ext.html
> Is this related to the subgroup shuffling, which is already implemented for NVIDIA and Intel but not used for AMD? Partially yes. Also there are lots of LDS reading/writing....
Refer to: https://gpuopen.com/learn/amd-gcn-assembly-cross-lane-operations/
@CNugteren without modifying the logic much, just replacing LDS r/w, not sure if that can improve the performance a lot. Seems like "invert" and "transpose" can be improved a lot....
Is "shuffle" can be applied to any opencl kernel? Any candidate kernel to investigate on?
Found this article interesting: https://cnugteren.github.io/tutorial/pages/page10.html Not sure in current impelmentation what this shfl logic is. Maybe here: https://github.com/CNugteren/CLBlast/blob/bcd294a93ad0dffbace51103215b1346ec3956df/src/kernels/level3/xgemm_part3.opencl#L240 seems like we can replace it with AMD opencl's extension for subgroup...
@CNugteren while I am working on a PR for using cross lane instruction to do subgroup shuffling, I have a question: https://github.com/CNugteren/CLBlast/blob/bcd294a93ad0dffbace51103215b1346ec3956df/src/kernels/level3/xgemm_part3.opencl#L47C3-L47C3 Here seems like the instruction: shfl.sync.idx.b32 only works...
Current AMD PR doesn't work with precision 64 when there needs two registers for double number. I will change the PR. But still don't know if current Nvidia implementation works....
@tyler-utah what do you think? https://github.com/CNugteren/CLBlast/blob/bcd294a93ad0dffbace51103215b1346ec3956df/src/kernels/level3/xgemm_part3.opencl#L47C3-L47C3 It's only using one instruction with 32 bit operand. How that supposed to work with 64 bit precision or N greater than 2?
@CNugteren code is ready for review. Have run xgemm tuner on 6900 XT. Performance improvement on round 3: With change: * Found best result 1.12 ms: 1913.8 GFLOPS * Best...