libCEED
libCEED copied to clipboard
CUDA/HIP Backend Refactor
As a result of how they were designed, there is a bunch of code duplication in the CUDA backends, and as a result of this CUDA duplication, the HIP backends inherited this same code duplication.
We should make a PR or series of PRs that is actually designed to refactor and reduce code duplication across these backends.
-
What code could/should be combined?
-
Where and how do we need to allow for differences between the backends and platforms?
-
Where do we want to test to prevent regression from aggressive or incorrect amalgamation between CUDA and HIP?
-
What needs to be done to allow the code generation backends (
gpu/cuda/gen
andgpu/hip/gen
) to share kernels from the other backends?
@tcew, @jedbrown, any anyone else who I'm missing but is interested, please feel free to jump into this issue or the discussion with thoughts I'm overlooking.
I would unify hip/ kernels/*
and cuda/kernels/*
into gpu_common_kernels/*
(or something similar), currently the code is duplicated. In the future it could be different, Hip/Cuda could have different code, but I think that the design would be better if we abstracted the fact that we target Hip or Cuda architectures. This would allow to test new implementations in a more modular way. We can have different implementations of the same algorithms that live under common_gpu_kernels
, then which one we decide on using is chosen when loading the source. The implementations loaded can be different for hip and cuda. The purpose being to try different implementations for different scenarios. We could imagine loading different implementations not only based on hip/cuda, but also on the number of quadrature points and degrees of freedom. We already know that there is no best performance approach for all cases, that different implementations work better for different cases, this design could also handle that.
Also, there is implementations that result in less register pressure. In certain applications, register pressure can become an issue when the QFunction gets big, this design would also allow to change the parallelization strategy to accommodate this kind of issues.
This design would potentially allow to fuse magma
backend kernels too.
On the topic of implementations that would be specific to hip or cuda, I am not aware of such things.
Generalizing code to target either Hip or Cuda is relatively trivial, the architecture specific keywords can easily be abstracted behind macros (CEED_DEVICE
, CEED_HOST
, CEED_HOST_DEVICE
, CEED_MEM_SHARED
, etc...).
I think a smaller first step could be refactoring the code generation backends to share the kernels that other backends use. Currently there are some minor differences, but I don't know why those differences were added.
This is a good point, if we gather code, then we have to document in the same place the reasons for the differences.
My proposal above is not a "first step" but a goal, I guess the different tasks would be:
- Gather common code between hip/cuda
- Generalize code that differ from hip/cuda to be architecture agnostic
- Generalize the
gen
interface to make sense withsimplices
(is it just the same as 1D tensor?) - Unify the interface of
ref
andshared
implementations with thegen
interface, this would allow to generategen
kernels fromref
andshared
functions, but also any other implementation idea that might come. - Add a mechanism to pick and generate a specific implementation.
For the long term health of these backends, I think we should do a cleanup and refactor in the near term. Combining kernels across the CUDA and HIP backends should come after this near term refactor. I don't know enough about performance studies between CUDA and HIP to attempt combining pieces of these two backend 'families' myself, but I do know enough to refactor the backend design into something cleaner.
Proposed near term refactor roadmap:
- [x] Merge #841
PR 2
- [x] Tidy mechanism by which
*-shared
and*-gen
reach into*-ref
for JiT,ceed
backend data, etc
PR 3
- [x] Pull kernel source strings into header files
- [x] Refactor
*-ref
and*-shared
kernels to use templates for compatibility with*-gen
PR 4
- [ ] Refactor
*-gen
to breakCeed*GenOperatorBuild
into smaller pieces - [ ] Add simplex support to
*-gen
- [x] Add collocated gradient support to
*-gen
I stalled out and focused on some Ratel work before wrapping up the final stage of this issue. @jedbrown I think this last stage of the GPU /gen
backend refactor would let us most easily incorporate the new basis (including particles) work into these backends. Depending upon prioritization, I think this would be a good think to try to make time for in the spring.