capytaine Torch GPU Solver

Hi @mancellin

I have a working code using Torch in a way that doesn't require any additional dependencies (on import it displays the GPU usage, or a message to install torch/cuda)

I expect this will need some modification for optimization & coverage at your discretion.

I think we can do more with this, as you mentioned in #443 if we put the core matrix building into a torch specific pattern, we would likely get lots of speed up due to the current memory bottle neck I'm experiencing. However I dont have any experience with that.

As far as library utility, I do not think any other BEM solvers have a GPU function yet so this would really put Capytaine on the map :)

Dec 22 '23 20:12 SoundsSerious

A note on performance, I am running what used to be a day long job in about 1 hour on my Nvidia 4070

Dec 22 '23 20:12 SoundsSerious

In my experience, the GPU linear solve i used from jax numpy, the data transfer alone overwhelms the performance. Only after 20k panels , it had some noticeable improvement.

Would be able to do some experiment with number of panels and the time ?

Dec 26 '23 00:12 KapilKhanal

@KapilKhanal I think this really depends on your memory speed so mileage may vary. New hardware has a memory speed of 6500 whereas the previous gen was about half of that.

Do you know of any benchmark code that exists?

Dec 26 '23 00:12 SoundsSerious

Thank you @SoundsSerious !

I am a bit reluctant merging this PR in the core of the library. Mostly because I don't have the experience nor the hardware to test and benchmark the feature. Also because dependencies and packaging is giving me enough headaches not to add Pytorch and CUDA on top of it, even as optional dependencies.

However, the goal of Capytaine is to be a library that advanced users can easily customize to their need. I'd be happy to add right now to the cookbook in the documentation an example of a resolution with GPU. How far can you go just by passing your solve_gpu function to the solver initialization as suggested in #443?

Also note that the function that you updated in the PR (solve_direct) is not the current default used by Capytaine. Indeed, it is often more efficient to compute instead the LU decomposition and use it several times (e.g. for each dofs at a given frequency). I saw that Pytorch.linalg has lu_factor and lu_solve, so maybe you can be optimize your GPU code with that.

A note on performance, I am running what used to be a day long job in about 1 hour on my Nvidia 4070

Could you share some details about the size of the problem? (Number of panels, number of frequencies, number of dofs, number of wave directions, and how parallel was the CPU computation)

Dec 26 '23 14:12 mancellin

@mancellin

Of course, I understand, this is outside the scope of capytaine. However, if you check the commit implementation I think this could be considered fail-safe as a bad import is handled by not installing the gpu_solver method as well as a warning provided on import.

I am thinking another way to handle the import would be the use of a dummy solver function that raises a helpful error on execution notifying the user of steps to use the default solver or to install the appropriate software, providing failsafe functionality. This would allow a more "batteries included" design ala python's design philosophy, while preventing downstream failures.

To your point this is also something that could be included in a cookbook for people to use and provide the same advantages.

I do think that the LU factor methods would be a good idea, In my testing I am finding that there are some limited gains near the GPU memory limit so finding any way to further improve the performance would be great.

Jan 02 '24 18:01 SoundsSerious

To your point this is also something that could be included in a cookbook for people to use and provide the same advantages.

The main difference is about managing users expectations. I'd be very happy to send the message that "Capytaine allows advanced users to leverage their GPU knowledge while using Capytaine". I'm more reluctant about "Capytaine includes a seamless GPU solver", at least for now.

Jan 03 '24 10:01 mancellin

Here is a code sample for LU factorisation with cache (only tested on CPU since I don't have a recent enough Nvidia GPU). The main missing optimization in comparison with default Capytaine is not fully taking advantage of vertical symmetry planes.

import numpy as np
import capytaine as cpt
import xarray as xr

## Custom linear solver setup

latest_A = None
latest_LU_decomp = None

import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def lu_linear_solver_with_cache_on_GPU(A, b):
    global latest_A, latest_LU_decomp
    if A is not latest_A:
        latest_A = A
        A_on_GPU = torch.from_numpy(np.array(A)).cfloat().to(device)
        latest_LU_decomp = torch.linalg.lu_factor(A_on_GPU)
        print("Computing LU decomposition with Pytorch")
    else:
        print("Reusing the same LU decomposition with Pytorch")
    b_on_GPU = torch.from_numpy(b.reshape(-1, 1)).cfloat().to(device)
    res = torch.linalg.lu_solve(*latest_LU_decomp, b_on_GPU)
    return res.cpu().numpy()

solver = cpt.BEMSolver(engine=cpt.BasicMatrixEngine(linear_solver=lu_linear_solver_with_cache_on_GPU))

## Your test case

mesh = cpt.mesh_sphere().immersed_part()
body = cpt.FloatingBody(mesh, dofs=cpt.rigid_body_dofs())

test_matrix = xr.Dataset(coords={
    "omega": np.linspace(0.1, 4.0, 10),
    "radiating_dof": list(body.dofs),
    })

ds = solver.fill_dataset(test_matrix, body)

Jan 03 '24 10:01 mancellin

The main difference is about managing users expectations. I'd be very happy to send the message that "Capytaine allows advanced users to leverage their GPU knowledge while using Capytaine". I'm more reluctant about "Capytaine includes a seamless GPU solver", at least for now.

Case in point: After getting access to an old Nvidia GPU, I spent a few hours today figuring out the last version of Pytorch supporting that GPU, updating your code to be compatible with that version of Pytorch, then getting (as expected) worse performance than a CPU for small meshes and some unexpected CUDA error for larger meshes.

As the one on the front line of Capytaine's users support, I don't want the responsibility to guide them through that.

Jan 03 '24 14:01 mancellin

As the one on the front line of Capytaine's users support, I don't want the responsibility to guide them through that.

Of course, I completely sympathize with this, I salute you for supporting this library so well. Its a great tool!

I completely sympathize with CUDA install issues, it was a real pain to get working on windows-system-linux. Your comments about adding to the cookbook I think make the most sense. Where in the docs would you want me to move this code / to support that?

Regarding GPU performance I found that my card with 12GB of memory was significantly faster until a certain mesh size then performance gains were on par with or slightly worse than CPU. How much memory did your card have? I could see a card maybe 10 years old with only 1GB of memory having serious issues at small mesh sizes, todays top of the line GPUs are 1000's of times faster than just a few years ago so for those who have invested in that tech navigating that complication could see a major benefit.

Here is the solver performance of various dense matrices in average time elapsed per case using different solvers. GPU_BEM_SolverSpeedup

Jan 03 '24 19:01 SoundsSerious

Where in the docs would you want me to move this code / to support that?

I guess you can add an example in the "advanced" section of https://github.com/capytaine/capytaine/blob/f5f0b9fb439d9d865adc1c91255b87537b9bbed1/docs/user_manual/cookbook.rst See the other examples for reference.

Here is the solver performance of various dense matrices in average time elapsed per case using different solvers.

Thank you for this impressive result. Can you confirm that this is only the time to solve the linear system without the time to build the matrix? Also I guess dim is the number of panels in the mesh. Then the 40000 panels jump is indeed probably a memory issue as storing the matrix (of double precision complex numbers) itself in memory would require more than your 12GB of memory.

Jan 09 '24 13:01 mancellin

Awesome I'll put together something hopefully around the end of april (big deadline ect).

This is the benchmark time of solving the system and also loading the result onto the GPU memory. You're also correct that dim is for the size of the matrix.

The analysis was done with several iterations of solving a dense dim x dim matrix of randomly generated coefficients (not sparse). I would expect a sparse system could have further speed up although I think BEM is generally non-sparse unless you make a special provision for that? I am not sure.

Jan 24 '24 02:01 SoundsSerious

I would expect a sparse system could have further speed up although I think BEM is generally non-sparse unless you make a special provision for that? I am not sure.

BEM's matrices are never sparse (in the sense of having a lot of zeros in the matrix). They can be data-sparse (in the sense that storing all the coefficients is not necessary to get most of the information) but this is still experimental in Capytaine and I'm not sure how it would behave on GPU.

Jan 24 '24 10:01 mancellin

As the one on the front line of Capytaine's users support, I don't want the responsibility to guide them through that.

Of course, I completely sympathize with this, I salute you for supporting this library so well. Its a great tool!

I completely sympathize with CUDA install issues, it was a real pain to get working on windows-system-linux. Your comments about adding to the cookbook I think make the most sense. Where in the docs would you want me to move this code / to support that?

Regarding GPU performance I found that my card with 12GB of memory was significantly faster until a certain mesh size then performance gains were on par with or slightly worse than CPU. How much memory did your card have? I could see a card maybe 10 years old with only 1GB of memory having serious issues at small mesh sizes, todays top of the line GPUs are 1000's of times faster than just a few years ago so for those who have invested in that tech navigating that complication could see a major benefit.

Here is the solver performance of various dense matrices in average time elapsed per case using different solvers.

First of all, great work - really exciting to see someone running Capytaine on the GPU and exactly the kind of innovative community contribution we were hoping to see when we decided to support this code! 😁

I also trust Matthieu's judgement implicitly and I wouldn't wish to create additional burdens for him.

Would it be possible to plot the blue line divided by the green line? I would really like to see the speed-up factor as a function of mesh size. Generally most of my models have meshes with nPanels < 2000 - its difficult to see what the impact of th GPU is in that region of the plot.

Cheers! Dave

Jan 29 '24 15:01 dav-og

@dav-og

Happy to help out and give back to the community, I was thrilled to find a BEM solver in python after coming back into the ocean energy space last year. Its been a game changer since it easily mixes with our other dependencies.

The code I used to benchmark the solver codes is nothing special, but represents each of the true solver methods in an apples to apples kind of way. The results show a roughly 8x speed up between 1000-20000 cells on my hardware, I think this will likely be very machine-dependent considering advances in Memory Speed (I'm using a 12GB Nvidia 4070 with FSB @ 3600Mhz ).

GPU_BemLogPlot

GPU_BEM_Speedup

import timeit
from matplotlib.pylab import *
import torch
import numpy
from numba import njit


def solve_numpy_cpu(a,b):
    numpy.linalg.solve(a, b)


def solve_torch_cpu(a,b):
    a = torch.from_numpy(a).cfloat()
    b = torch.from_numpy(b).cfloat()    
    torch.linalg.solve(a,b)

def solve_torch_gpu(a,b):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    a = torch.from_numpy(a).cfloat().to(device)
    b = torch.from_numpy(b).cfloat().to(device)

    res = torch.linalg.solve(a, b)

results = {}

dims = [5,10,50,100,500,1000,5000,10000,20000,30000,40000]

funcs = (solve_numpy_cpu, solve_torch_cpu, solve_torch_gpu)
func_dict = {f.__name__:f for f in funcs}

times = 5

def main():
    for dim in dims:
        a = numpy.random.rand(dim, dim)
        b = numpy.random.rand(dim)
        
        for fname,f in func_dict.items():
            time = timeit.timeit( lambda: f(a,b) , number=times)
            
            if fname in results:
                results[fname].append(time)     
            else:
                results[fname] = [time]
    
            print(f"{f.__name__}: {dim:<10}|{time:f}")
            
main()


fig,ax = subplots()
for fname,res in results.items():
    ax.plot(dims[:len(res)],res,label=fname)
ax.grid()
ax.legend()
ax.set(yscale='log',xscale='log',xlabel='Dimension',ylabel='Time (s)')

@mancellin hoping to document this feature in the location you specified earlier than previously stated as my clients NSF deadline got extended, so hopefully we can close this soon.

Jan 30 '24 04:01 SoundsSerious

This chart i think is a bit misleading :)

GPU_BemLinSpeedup

Jan 30 '24 05:01 SoundsSerious