libCEED
libCEED copied to clipboard
Clang JIT CPU Backend
Clang 16 now supports JIT. An interesting small project could be to create a /cpu/self/clang-jit backend that provides JITed tensor contraction kernels. If we see performance that is in the neighborhood of AVX or libXSMM, this could be a way to ship a faster CPU backend with fewer dependencies.
See Serac for reference: https://github.com/LLNL/serac/blob/prototype/adjoints_with_internal_variables/tests/jit/basic_jit.cpp https://github.com/LLNL/serac/blob/prototype/adjoints_with_internal_variables/include/JIT.hpp
(This repo comes from a member of Jamie's Smith team)
Certainly interesting, but do note that we have a limited number of combinations in tensor contractions so this is more of a solution to a finding that compile-time constant sizes are a huge benefit and that we can't pare down that combinatorial space to do ahead-of-time specialization.
A different use might be to use JIT to build single-precision versions of select kernels.
Right, I'd expect that if we enumerated a bunch of kernels ahead of time across combos of p, q, num_comp, and blocked/serial we'd see the same performance, but that approach is intractable.
WRT performance I just mean that my gut expects the performance of such a backend to be between AVX and LIBXSMM, but without the need for a user to build LIBXSMM so we might get a little better performance in our upcoming Ratel + Enzyme container.
I agree that single-precision kernels would be an interesting avenue to explore too so its easier to get mixed precision capabilities.
It's a low-effort test to see if specializing one particular size has much benefit. Like just drop in some integer literals and run a benchmark using matching sizes. If it's a lot faster, we can see if specializing all the values is important or, say, just one matters. If it's about the same, we don't need to pursue the idea (at least until we learn more).
That's a good point. Its a easy test to check if someone finds time. I don't see this as a particular priority - 50% of why I created this issue was so we don't lose track of this as an option.
Reviving this - @YohannDudouit has had a lot of good success with putting a #pragma omp for before the element loop and seeing 100x or better speedup on the CPU. Our current design for CPU backends doesn't make this approach tractable, but if we created a /cpu/self/clang-jit backend that works like /gpu/cuda/gen, then we could bypass the issues (largely around each thread needing independent scratch space for the E-vecs and Q-vecs and needing to pass memory through CeedVectors on each thread to call through the API to apply CeedRestriction, CeedBasis, and CeedQFunction). This should be pretty straightforward to set up, but rather time consuming to do the work and debug.
Reviving this - @YohannDudouit has had a lot of good success with putting a
#pragma omp forbefore the element loop and seeing 100x or better speedup on the CPU.
I'm assuming that's comparing a single thread to OMP multi-threaded? 100x operator performance improvement seems wild assuming it's not on a 100 core machine.
Do we have any idea of MPI+OMP would be better than just straight MPI? My understanding is that MPI+OMP has better memory utilization, but doesn't necessarily reduce communication as the blocking dependencies have been moved rather than eliminated.
@jrwrigh It does not even necessarily have been memory utilization because it interferes with vectorization (as pointed out by Jed in his talk here: https://cse.buffalo.edu/~knepley/relacs.html#workshop-siam-pp-2016). Those talks show that MPI+OMP is almost always worse than MPI alone. Also https://figshare.com/articles/journal_contribution/Exascale_Computing_without_Threads-Barry_Smith_pdf/5824950?file=10305396
I don't recall the exact details since I did not write the code, but I did watch the code go 100xish faster when compiled with OMP than without on the same problems on some LLNL resources.
Reviving this - @YohannDudouit has had a lot of good success with putting a
#pragma omp forbefore the element loop and seeing 100x or better speedup on the CPU.I'm assuming that's comparing a single thread to OMP multi-threaded? 100x operator performance improvement seems wild assuming it's not on a 100 core machine.
Do we have any idea of MPI+OMP would be better than just straight MPI? My understanding is that MPI+OMP has better memory utilization, but doesn't necessarily reduce communication as the blocking dependencies have been moved rather than eliminated.
@jrwrigh It was on a 112 cores machine: https://hpc.llnl.gov/hardware/compute-platforms/dane