Molly.jl icon indicating copy to clipboard operation
Molly.jl copied to clipboard

GPU compatibility

Open SebastianM-C opened this issue 5 years ago • 4 comments

I think it would be useful to discuss here the design changes required for working on GPUs.

@jgreener64, could you detail a bit on what's needed for working on GPUs? I was wondering if KernelAbstractions.jl could help.

SebastianM-C avatar Aug 17 '20 14:08 SebastianM-C

Yeah it's a good time to discuss it, and in the context of the package more broadly. We want an API where users can plug and play their own components, yet is:

  • fast on CPU
  • works CPU parallel
  • fast on GPU
  • differentiable.

This is clearly a big ask, and under the hood we will likely have to have multiple implementations. I think the DiffEq ecosystem has two versions, in-place and out-of-place, of some algorithms.

Personally, I am willing to sacrifice some speed to achieve the above aims. I doubt we will ever be able to match the likes of GROMACS for speed, but if we are in the same ballpark and can offer a powerful and flexible API then we have a useful contribution.

My current thinking is to have two implementations, which would be used like this:

Standard Differentiable
CPU A B
CPU parallel A -
GPU B B

A is the current implementation on master, which is already 2 implementations in the sense that it branches based on parallel, but we'll ignore that. B would be a broadcasted, out-of-place version that allocates more memory but works with Zygote and makes use of GPU broadcasting without custom kernels. Of course we could write GPU kernels for everything, but that would be another implementation and if we can get decent performance with broadcasting I'd prefer to do that.

I have some rough work on the gpu branch that runs but is slower than the main CPU implementation, currently by a factor of 5x on CPU and 2.5x on GPU. Profiling suggests the time is in broadcasting but I'll have to dig further and probably ask some people who know more about Julia on GPUs. It is nearly differentiable, I think there is one adjoint missing. Consequently, this could be the foundation for implementation B.

There are some changes to the force functions to make broadcasting work. We would probably have to make the corresponding changes to implementation A to make the APIs match, but I think it's possible without sacrificing performance.

KernelAbstractions.jl is a possibility though I don't know too much about it, or how it works with autodiff.

Let me know what you think of the above thoughts.

jgreener64 avatar Aug 17 '20 19:08 jgreener64

I am in favor of the in-place and out of place APIs. I think this would go well with a future DiffEq integration (i.e. replace the simulators with ODEProblems).

I'm still not sure why the out of place version would allocate more in the CPU if we use StaticArrays. I will take a closer look to that branch.

Regarding differentiability, I think it would be simpler to do it after the DiffEq integration (I'll use a separate issue for that discussion).

SebastianM-C avatar Aug 19 '20 00:08 SebastianM-C

Regarding performance and broadcasting, one of the optimizations in the energy function PR was about specializing the dr computation for SVectors by manually unrolling the broadcast. I'm not sure how this would work in the GPU context, but that computation is one of the most performance sensitive parts of the code (as almost anything needs that distance).

Also, regarding benchmarks, I would be very curious to see how we compare to GROMACS. Do you have a test problem for that?

SebastianM-C avatar Aug 19 '20 00:08 SebastianM-C

Sounds sensible. I have made progress on fixing the broadcasting issues, due to GPU array views in a new broadcast I introduced to loop over ij pairs. The GPU implementation now looks faster than the CPU implementation by >10x. It's still rough and ready, and calculates each force twice, but I'll have some time to work on it in the next week.

Also, regarding benchmarks, I would be very curious to see how we compare to GROMACS. Do you have a test problem for that?

I used to have GROMACS timings for the peptide test case, which is nice in that it combines bonded and non-bonded forces, but I think they are at work and I am now working from home. It shouldn't be too hard to run a short simulation from the top file in data/5XER to get a rough time per timestep, I'll try and do that at some point. We are a way off in terms of implemented algorithms though: we would need to implement cell-based neighbour lists, Ewalds summation and SHAKE-like hydrogen algorithms for that case to be a fair comparison.

jgreener64 avatar Aug 20 '20 13:08 jgreener64

The GPU path now uses CUDA.jl kernels and works with AD.

jgreener64 avatar Apr 19 '23 17:04 jgreener64