Molly.jl
                                
                                
                                
                                    Molly.jl copied to clipboard
                            
                            
                            
                        GPU compatibility
I think it would be useful to discuss here the design changes required for working on GPUs.
@jgreener64, could you detail a bit on what's needed for working on GPUs? I was wondering if KernelAbstractions.jl could help.
Yeah it's a good time to discuss it, and in the context of the package more broadly. We want an API where users can plug and play their own components, yet is:
- fast on CPU
 - works CPU parallel
 - fast on GPU
 - differentiable.
 
This is clearly a big ask, and under the hood we will likely have to have multiple implementations. I think the DiffEq ecosystem has two versions, in-place and out-of-place, of some algorithms.
Personally, I am willing to sacrifice some speed to achieve the above aims. I doubt we will ever be able to match the likes of GROMACS for speed, but if we are in the same ballpark and can offer a powerful and flexible API then we have a useful contribution.
My current thinking is to have two implementations, which would be used like this:
| Standard | Differentiable | |
|---|---|---|
| CPU | A | B | 
| CPU parallel | A | - | 
| GPU | B | B | 
A is the current implementation on master, which is already 2 implementations in the sense that it branches based on parallel, but we'll ignore that. B would be a broadcasted, out-of-place version that allocates more memory but works with Zygote and makes use of GPU broadcasting without custom kernels. Of course we could write GPU kernels for everything, but that would be another implementation and if we can get decent performance with broadcasting I'd prefer to do that.
I have some rough work on the gpu branch that runs but is slower than the main CPU implementation, currently by a factor of 5x on CPU and 2.5x on GPU. Profiling suggests the time is in broadcasting but I'll have to dig further and probably ask some people who know more about Julia on GPUs. It is nearly differentiable, I think there is one adjoint missing. Consequently, this could be the foundation for implementation B.
There are some changes to the force functions to make broadcasting work. We would probably have to make the corresponding changes to implementation A to make the APIs match, but I think it's possible without sacrificing performance.
KernelAbstractions.jl is a possibility though I don't know too much about it, or how it works with autodiff.
Let me know what you think of the above thoughts.
I am in favor of the in-place and out of place APIs. I think this would go well with a future DiffEq integration (i.e. replace the simulators with ODEProblems).
I'm still not sure why the out of place version would allocate more in the CPU if we use StaticArrays. I will take a closer look to that branch.
Regarding differentiability, I think it would be simpler to do it after the DiffEq integration (I'll use a separate issue for that discussion).
Regarding performance and broadcasting, one of the optimizations in the energy function PR  was about specializing the dr computation for SVectors by manually unrolling the broadcast. I'm not sure how this would work in the GPU context, but that computation is one of the most performance sensitive parts of the code (as almost anything needs that distance).
Also, regarding benchmarks, I would be very curious to see how we compare to GROMACS. Do you have a test problem for that?
Sounds sensible. I have made progress on fixing the broadcasting issues, due to GPU array views in a new broadcast I introduced to loop over ij pairs. The GPU implementation now looks faster than the CPU implementation by >10x. It's still rough and ready, and calculates each force twice, but I'll have some time to work on it in the next week.
Also, regarding benchmarks, I would be very curious to see how we compare to GROMACS. Do you have a test problem for that?
I used to have GROMACS timings for the peptide test case, which is nice in that it combines bonded and non-bonded forces, but I think they are at work and I am now working from home. It shouldn't be too hard to run a short simulation from the top file in data/5XER to get a rough time per timestep, I'll try and do that at some point. We are a way off in terms of implemented algorithms though: we would need to implement cell-based neighbour lists, Ewalds summation and SHAKE-like hydrogen algorithms for that case to be a fair comparison.
The GPU path now uses CUDA.jl kernels and works with AD.