libCEED icon indicating copy to clipboard operation
libCEED copied to clipboard

Clang JIT CPU Backend

Open jeremylt opened this issue 2 years ago • 4 comments

Clang 16 now supports JIT. An interesting small project could be to create a /cpu/self/clang-jit backend that provides JITed tensor contraction kernels. If we see performance that is in the neighborhood of AVX or libXSMM, this could be a way to ship a faster CPU backend with fewer dependencies.

See Serac for reference: https://github.com/LLNL/serac/blob/prototype/adjoints_with_internal_variables/tests/jit/basic_jit.cpp https://github.com/LLNL/serac/blob/prototype/adjoints_with_internal_variables/include/JIT.hpp

(This repo comes from a member of Jamie's Smith team)

jeremylt avatar Jun 21 '23 20:06 jeremylt

Certainly interesting, but do note that we have a limited number of combinations in tensor contractions so this is more of a solution to a finding that compile-time constant sizes are a huge benefit and that we can't pare down that combinatorial space to do ahead-of-time specialization.

A different use might be to use JIT to build single-precision versions of select kernels.

jedbrown avatar Jun 21 '23 20:06 jedbrown

Right, I'd expect that if we enumerated a bunch of kernels ahead of time across combos of p, q, num_comp, and blocked/serial we'd see the same performance, but that approach is intractable.

WRT performance I just mean that my gut expects the performance of such a backend to be between AVX and LIBXSMM, but without the need for a user to build LIBXSMM so we might get a little better performance in our upcoming Ratel + Enzyme container.

I agree that single-precision kernels would be an interesting avenue to explore too so its easier to get mixed precision capabilities.

jeremylt avatar Jun 21 '23 21:06 jeremylt

It's a low-effort test to see if specializing one particular size has much benefit. Like just drop in some integer literals and run a benchmark using matching sizes. If it's a lot faster, we can see if specializing all the values is important or, say, just one matters. If it's about the same, we don't need to pursue the idea (at least until we learn more).

jedbrown avatar Jun 21 '23 21:06 jedbrown

That's a good point. Its a easy test to check if someone finds time. I don't see this as a particular priority - 50% of why I created this issue was so we don't lose track of this as an option.

jeremylt avatar Jun 21 '23 21:06 jeremylt

Reviving this - @YohannDudouit has had a lot of good success with putting a #pragma omp for before the element loop and seeing 100x or better speedup on the CPU. Our current design for CPU backends doesn't make this approach tractable, but if we created a /cpu/self/clang-jit backend that works like /gpu/cuda/gen, then we could bypass the issues (largely around each thread needing independent scratch space for the E-vecs and Q-vecs and needing to pass memory through CeedVectors on each thread to call through the API to apply CeedRestriction, CeedBasis, and CeedQFunction). This should be pretty straightforward to set up, but rather time consuming to do the work and debug.

jeremylt avatar Aug 12 '25 16:08 jeremylt

Reviving this - @YohannDudouit has had a lot of good success with putting a #pragma omp for before the element loop and seeing 100x or better speedup on the CPU.

I'm assuming that's comparing a single thread to OMP multi-threaded? 100x operator performance improvement seems wild assuming it's not on a 100 core machine.

Do we have any idea of MPI+OMP would be better than just straight MPI? My understanding is that MPI+OMP has better memory utilization, but doesn't necessarily reduce communication as the blocking dependencies have been moved rather than eliminated.

jrwrigh avatar Aug 12 '25 18:08 jrwrigh

@jrwrigh It does not even necessarily have been memory utilization because it interferes with vectorization (as pointed out by Jed in his talk here: https://cse.buffalo.edu/~knepley/relacs.html#workshop-siam-pp-2016). Those talks show that MPI+OMP is almost always worse than MPI alone. Also https://figshare.com/articles/journal_contribution/Exascale_Computing_without_Threads-Barry_Smith_pdf/5824950?file=10305396

knepley avatar Aug 12 '25 18:08 knepley

I don't recall the exact details since I did not write the code, but I did watch the code go 100xish faster when compiled with OMP than without on the same problems on some LLNL resources.

jeremylt avatar Aug 12 '25 18:08 jeremylt

Reviving this - @YohannDudouit has had a lot of good success with putting a #pragma omp for before the element loop and seeing 100x or better speedup on the CPU.

I'm assuming that's comparing a single thread to OMP multi-threaded? 100x operator performance improvement seems wild assuming it's not on a 100 core machine.

Do we have any idea of MPI+OMP would be better than just straight MPI? My understanding is that MPI+OMP has better memory utilization, but doesn't necessarily reduce communication as the blocking dependencies have been moved rather than eliminated.

@jrwrigh It was on a 112 cores machine: https://hpc.llnl.gov/hardware/compute-platforms/dane

YohannDudouit avatar Aug 12 '25 21:08 YohannDudouit