ysh329
ysh329
## 6. 搜索策略的经验 两种启发式算法:模拟退火和粒子群优化,都有其各自的特点,不同的问题哪一种更合适需要尝试的。 data:image/s3,"s3://crabby-images/f70ff/f70ff06f9da560208ea6d2a3ebd3d5a3f04a7a70" alt="image" **表:作者实验调优的硬件** 通过作者的尝试,也发现一些经验: 1. 当用户自定义卷积核比较小时,可以将其放到OpenCL constant mem中; 2. 在2D卷积实验中,对完整搜索空间的搜索结果的性能分布上观察,只有极少的设置下性能很好。我的理解是,参数间的强相关,整个搜索空间的较好性能情况还是非常稀疏的; 3. 在2D卷积实验中,模拟退火和粒子群在某些硬件上表现好,但有些反之,应该是落入到了局部最优后续也出不来了; 4. 在矩阵乘法实验中,最佳的7类参数在下标中,可以看出不同的设备上基本都是不同的。 其实类似的实验经验还有一些,但是都是设备相关的,不具有普适性。总的来说,CLTune提供了在OpenCL Kernel上为每一个硬件设备、以模板化方法实现来调优的思路,将异构计算的通用性思维发扬光大。 > 但其实手写常用算子+tuning的成本确实不高,但是长远来看,长尾算子、算子融合这些,实现成本就太高了。还是需要将tune策略与codegen结合起来的。 data:image/s3,"s3://crabby-images/81ed0/81ed0b6e43385eba709418fae5002b75cfa7bbb3" alt="image" data:image/s3,"s3://crabby-images/bbfe3/bbfe38de40bfccac66a690933d099e97596075dc" alt="image" data:image/s3,"s3://crabby-images/e6d58/e6d58ed3a40df004b07b9eb2421f314ef68fe4ef" alt="image"
data:image/s3,"s3://crabby-images/d42e0/d42e0b122464bb5f0be9049028a747c2dd812a6a" alt="image"
data:image/s3,"s3://crabby-images/1b31c/1b31c7127a06453a51a46ffeef0afea686081529" alt="image"
# aticonfig I tried PowerXpress options but result is disappointing. ```shell gpu@gpu-FP4:~/yuanshuai/code/CLBlast/build$ sudo aticonfig --px-list-active-gpu PowerXpress: Discrete GPU is active (High-Performance mode). gpu@gpu-FP4:~/yuanshuai/code/CLBlast/build$ sudo aticonfig --pxl PowerXpress: Discrete GPU is...
Besides, I found a tool named AGT from this link: [Manage your GPU HW · amd/OpenCL-caffe Wiki](https://github.com/amd/OpenCL-caffe/wiki/Manage-your-GPU-HW). However, it seems a window tool ([Download AMD GPU Clock Tool | TechPowerUp](https://www.techpowerup.com/download/amd-gpu-clock-tool/)...
## Adreno GPU SDK - FAQs - Qualcomm Developer Network https://developer.qualcomm.com/software/adreno-gpu-sdk/faq ### What is included in the Adreno SDK for OpenCL? This SDK includes usage examples for Qualcomm Technologies extensions...
OpenCL Tips · yszheda/wiki Wiki https://github.com/yszheda/wiki/wiki/OpenCL-Tips
Sub-optimal performance on Qualcomm Adreno GPUs · Issue #228 · CNugteren/CLBlast https://github.com/CNugteren/CLBlast/issues/228
Float16 GEMM on Adreno 330 · Issue #181 · CNugteren/CLBlast https://github.com/CNugteren/CLBlast/issues/181 do not have a certain result of float16
local work size和work group size > ## Opencl global work size vs local work size > In both cases the global size is 1024. In case 1, the local size...