Molly.jl KernelAbstractions support

The KernelAbstractions branch now compiles, so I thought I would put forward a quick draft PR while I figure out all the runtime bugs.

Notes:

This builds off of #99 and should replace it entirely
CUDA has been removed and replaced with KernelAbstractions and GPUArrays. As an important note here, GPUArrays is not strictly necessary except to replicate the behavior of the boolean GPU flag (ie isa(a, AbstractGPUArray)).
If this is merged, other GPU types (Metal, AMD, Intel) will also be supported, but I can only test on AMD (and maybe Metal if I can get someone to try it with a mac).
I need to add in the changes from #133. If there is something we are missing on the KernelAbstractions side, I can try to add it in, but I think we are good to go.

Sep 07 '23 13:09 leios

Ah, I guess while I'm here, I'll briefly explain the differences with CUDA syntactically:

Indexing is easier: @index(Global / Group / Local, Linear / NTuple / CartesianIndex) vs (blockIdx().x - 1) * blockDim().x + threadIdx().x for CUDA
Kernels run off of an ndrange for the range of elements (OpenCL inspired syntax)
Launching kernels requires configuration with a backend, see: https://github.com/leios/Molly.jl/blob/KA_support/src/kernels.jl#L21
Certain functions now execute on the backend CUDA.zeros(...) -> zeros(backend, args...)

The tricky thing about this PR was removing the CUDA dependency outside of the kernels. There is still one call in zygote.jl I gotta figure out: https://github.com/leios/Molly.jl/blob/KA_support/src/zygote.jl#L698

Sep 07 '23 13:09 leios

Great work so far. Making the code compatible with generic array types is a nice change, and having the kernels work on different devices would be a selling point of the package.

I would be interested to see the performance of the kernels compared to the CUDA versions. Also whether it plays nicely with Enzyme. Good luck with the runtime errors.

Sep 07 '23 22:09 jgreener64

I think I can finish this up today or else early next week (emphasis on think), but to quickly answer the questions:

KA essentially just writes vendor-specific code (ie CUDA) from the generic code input, so if we don't have identical performance to CUDA, then that's a bug. I'll do the performance testing similar to https://github.com/JuliaMolSim/Molly.jl/pull/133 once the code is cleaned up.
Enzyme should also not be an issue; however, there are some reports of error handling being an issue: https://github.com/EnzymeAD/Enzyme.jl/issues/365

Sep 08 '23 04:09 leios

Great. Not urgent, but how well does KernelAbstractions.jl deal with warp-level code, e.g. warpsize() and sync_warp()?

Sep 08 '23 13:09 jgreener64

That's a good question. We can probably expose the APIs available from CUDA, but I am not sure how AMDGPU deals with these. We would also just need to figure out what that corresponds to on parallel CPU.

I think these are the tools we need: https://rocm.docs.amd.com/projects/rocPRIM/en/latest/warp_ops/index.html So they are available, it's just a matter of exposing them in KA and figuring out what it corresponds to for different backends.

Ah, as an important note (that I somehow failed to mention before), an advantage of KA is that it also provides a parallel CPU implentation, so the kernels can be written once and executed everywhere. I didn't do that in this PR because that brings up design questions related to Molly internals.

Sep 08 '23 13:09 leios

I didn't do that in this PR because that brings up design questions related to Molly internals.

Yeah we can discuss that after this PR. I would be okay with switching if there was no performance hit.

Judging from discussion on the linked PR there is not currently warp support in KA. It may be necessary to leave that CUDA kernel in and have a separate KA kernel for other backends until warp support comes to KA.

Sep 08 '23 22:09 jgreener64

Ok, so a couple of quick notes here:

There are a few host calls that are not yet supported by AMDGPU (such as findall). My understanding was that such calls would eventually be ported to GPUArrays, but I don't think that has happened yet. Note that some of the stalling here is because we are waiting to get KA into GPUArrays (https://github.com/JuliaGPU/GPUArrays.jl/pull/451). At least for findall, the kernel is not that complex: https://github.com/JuliaGPU/CUDA.jl/blob/master/src/indexing.jl#L23, so we could put it into AMDGPU or something for now; however, we are stuck on an older version of AMDGPU due to some package conflicts. The quick fix would be to do it the ol' fashioned way and just stick the necessary kernels in Molly under a file like, kernel_hacks,.jl or something. Such issues were also what stalled #99.
#133 seems to only use warpsize and warp_sync for warp-level semantics. The KA kernel would probably get the warpsize on the host and then pass it in as a parameter. warp_sync is a bit more interesting because, well, at least in the old days warps didn't need any synchronizing. It seems that things changed in Volta and most people missed the memo. Because of this, the easiest thing to do would be to keep the CUDA dependency for that one kernel. We could also add in warp-level semantics to KA, but that would take some time to propagate to all the independent GPU APIs and (as mentioned in 1), we are kinda stuck on older versions of AMDGPU and CUDA because of compatability with other packages.
I am realizing that there is a greater conflict with this PR. Namely, I don't know if I have the bandwidth to do any sort of maintainence on Molly after this PR is in. I don't know if it's fair to ask you to merge 1000 lines of code with a new API and then leave. On the other hand, getting this to work on AMD would be great and really useful. Let me think on that.

Sep 09 '23 11:09 leios

Because of this, the easiest thing to do would be to keep the CUDA dependency for that one kernel.

That is okay.

I don't know if it's fair to ask you to merge 1000 lines of code with a new API and then leave.

I wouldn't worry about this. Currently I only merge stuff that I am able to maintain, or where I think I can skill up to the point of maintaining it. The changes here seem reasonable and worth merging once any errors and performance regressions are fixed. There is a wider question about whether KernelAbstractions.jl will continue to be maintained compared to CUDA.jl, but it seems to have decent traction now.

Sep 10 '23 18:09 jgreener64

Yeah, the plan is for KA to be used even within GPUArrays, so it's not going anywhere anytime soon. Speaking of which, the "correct" course of action for KA in Molly would be to get the KA in GPUArrays first and then use that to implement any missing features on the GPUArrays level.

Would it be better for me to separate this PR then? Maybe one doing the generic Array stuff and then another with the KA support?

Sep 10 '23 20:09 leios

I would try and get this PR working as is. Only if that becomes difficult would it be worth splitting out and merging the generic array support.

If KA is here for the long haul then there is a benefit to switching the kernels even if only CUDA works currently. Because then when changes happen elsewhere, AMDGPU will work without any changes required in Molly.

Sep 10 '23 22:09 jgreener64

Codecov Report

Attention: Patch coverage is 12.79621% with 368 lines in your changes missing coverage. Please review.

Project coverage is 69.45%. Comparing base (49d92d9) to head (1868f3f).

Files	Patch %	Lines
src/kernels.jl	1.07%	184 Missing :warning:
src/chain_rules.jl	0.99%	100 Missing :warning:
src/interactions/implicit_solvent.jl	5.26%	36 Missing :warning:
ext/MollyCUDAExt.jl	0.00%	12 Missing :warning:
src/neighbors.jl	28.57%	10 Missing :warning:
src/spatial.jl	33.33%	8 Missing :warning:
src/zygote.jl	0.00%	8 Missing :warning:
ext/MollyPythonCallExt.jl	0.00%	2 Missing :warning:
src/energy.jl	33.33%	2 Missing :warning:
src/force.jl	33.33%	2 Missing :warning:
... and 3 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #147      +/-   ##
==========================================
- Coverage   71.58%   69.45%   -2.13%     
==========================================
  Files          37       38       +1     
  Lines        5546     5723     +177     
==========================================
+ Hits         3970     3975       +5     
- Misses       1576     1748     +172

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Jun 28 '24 11:06 codecov[bot]