Winograd-OpenCL
Winograd-OpenCL copied to clipboard
The results of Winograd conv functions is wrong?
Yout work is good try for cnn developers. But when I test the your code , I found that the winograd resulst is not equal with the traditional conv, Why?
Sorry for late reply, it was holiday in Korea :(
It is unpolished piece of codes; you can see lots of comments in validate(...)
and main(...)
in main.c
so it's not functional.
If you want to test, you should uncomment convolution_cpu(...)
so that you can compare results from CPU and GPU.
@TaihuLight @csehydrogen I made some changes in code and the results was equal with the traditional. But I found the winograd with the 3x3 filters is slower than the traditional algorithm with GPU in the OpenCL. The paper “Fast algorithms for convolutional neural networks" performed the winograd better than traditional with cuDnn. Look forward to your answer for me.
I use simple opencl code ( https://github.com/mz24cn/clnet/blob/master/src/kernels.cl#L508 ) and get more than a half performance when running Lenet5, compared with other DL software based on CUDA/cuDNN. I am considering adopt winograd algorithm. I wanna know whether can I gain substantial improvement from it. The code is very complex.
@mz24cn For the same device, the performance of opencl app is lower than CUDA implementation. This is contribute to OpenCL rather than OpenCL implementation. @fzuwill Could you share your modified code?
OpenCL比CUDA慢这个是预期的,做BLAS也是这样。对于同一个NVIDIA设备,GEMM执行效率,最好的cuBLAS比最好的OpenCL库快一倍左右。据说原因应该是NVIDIA官方库cuBLAS/cuDNN应该都有汇编层次的优化导致的。 AMD的Vega显卡也支持在OpenCL里做汇编指令优化,好像能获得跟NVIDIA旗舰显卡类似的性能了。参见他们的MIOpen。
我的问题,主要是跟OpenCL自己比,用Winograd能快多少?快的很有限的话(比如提升不到一两倍),就没有跟进的价值。因为代码很复杂(像gemm/Tiling那样带来复杂的同步问题),也会带来稳定性问题。clBLAST/clBLAS库就是这样,我的测试表明,在大部分情况下,clblast/clblas的sgemm比简单的代码也仅仅快一倍左右。有些情况还会变慢,甚至在特定的硬件上报错。
跟CUDA比的话,能达到他50%的性能,最近几年的这种阶段,可以接受,不接受也没办法,这种东西只有做芯片的厂商自己搞才有胜出机会。我们都在上层,干不了什么。
@mz24cn https://github.com/CNugteren/CLBlast/issues/95 I get over two times speed increase with Strided function in Winograd convolution (https://github.com/CNugteren/CLBlast/issues/237). GemmBatched gave 433 nodes/s and GemmStridedBatched 1047 nodes/s. https://github.com/gcp/leela-zero/pull/523
But I have not run it successfully! You can try it!
I suggest that your repo https://github.com/mz24cn/clnet would be better to implemented with pure C + OpenCL than C++, liking Darknet!
你为什么觉得纯C好,哪些理由? clnet里我混用引用,和指针,就这一点我还在犹豫,我想放弃引用,引用太多不方便,尤其是要放到容器里时。我认为像java那样就很好。我看darknet的代码,有的时候用结构体指针,有的时候用结构体本身做参数传递。也是混乱的。使用结构体做函数参数,性能还不如用引用。
其他方面C++还是不错的,我极少用模板,代码不至于看起来支离破碎。你不会是因为自己的硬件没有C++编译器而倾向用C吧?哈哈,开个玩笑。
C++ is too complex like class ....., and I don't like it. My device and os support g++ 7.2 . Pure C is suitable to learning and developing for most programmers.
"@TaihuLight @csehydrogen I made some changes in code and the results was equal with the traditional. But I found the winograd with the 3x3 filters is slower than the traditional algorithm with GPU in the OpenCL. The paper “Fast algorithms for convolutional neural networks" performed the winograd better than traditional with cuDnn. Look forward to your answer for me."
So @fzuwill could you please make a pull request with your changes? and paste your results?
I believe this kernel code is wrong!
One work-group (==CUDA's thread block) launched 245 threads, the 1st half 128 do image transform and the 2nd half 128 do filter transform.
if (tid < 128) { // image transform ... barrier(CLK_LOCAL_MEM_FENCE); ... } else { // filter transform ... }
While, there are work-group-wide (for 256 threads) barrier inside threads's if else branch, which will cause deadlock!
@csehydrogen That is why @TaihuLight observe not equal results.