KernelAbstractions.jl
KernelAbstractions.jl copied to clipboard
Doc: what does CPU() do ?
Thank you for this beautiful library !
In contrast with another recent issue I'm finding rather large speedup for the CPU kernel that I have implemented. The doc does not say anything that might explain it. Is it using oneAPI.jl or some other smart Julia rewrite ?
In my application (some form of dynamic optimization) I have a large loop, that is conceptually SIMD even though I'm using higher level constructs which apparently prevent it from being automatically vectorized (even with @simd keyword). It does not allocate.
Using Julia's Threads I get a X6 speedup (I have 8 cores) over the monocorde version. Using KA/CPU kernels I get a X12 speedup.
The second KA/CPU seems rather efficient and suggests some degree of vectorization but I don't find a way to confirm my suspicion. I'd it what happens ?
Sorry not to provide a minimal working example. I could work out one with some effort but would like to know first how suspicious I should be or nicer than expected results.
You can use @macroexpand on the @kernel to see the code KA generates for the CPU.
Thank you @vchuravy !
I did what you suggest and looked at the generated code, but there is no useful information there as most of the logic seems to be implemented in the library KA itself. I've tried to dig into the source, noticed plenty of inference barriers and (?) compiler directives, but must recognize I am out of my depth as to what exactly causes the performance gains.
I found no mention of oneAPI, so I assume it has nothing to do with it
In the code, there is a comment which says "# Vectorization, 4x unrolling, minimal grain size", but have no idea where that is implemented. I also checked the results of macros @ka_code_typed, @ka_code_llvm but didn't find any hint about some explicit vectorization.
So my working assumption, is that KA generates for each core of the CPU some LLVM code, which is suitable for autovectorization? Is that true?