amgcl icon indicating copy to clipboard operation
amgcl copied to clipboard

AMGCL hybrid mode using CPU(s)+GPU(s)

Open klausbu opened this issue 4 years ago • 5 comments

Hi Denis,

I'd like to run OpenFOAM in a hybrid mode using CPU(s)+GPU(s), splitting the workload between CPU(s) and GPU(s) on a single node/workstation, using the GPU(s) as booster(s). This way, GPU(s) can be leveraged when available considering usually limited available GPU memory, compute performance and PCI-E bandwidth. A custom PCG implementation showed significant performance gains but wasn't published and didn't leverage preconditioner.

Is there a way to link the amg or better amg_mpi backend with the VexCL or cuda backend? Or maybe there's a better approach?

Klaus

klausbu avatar Dec 09 '19 15:12 klausbu

I did once write an OpenFOAM interface to Amgcl (https://github.com/mattijsjanssens/mattijs-extensions/tree/master/applications/test/amgcl/amgclSolver) but that used AMG preconditioned GC and wasn't faster on the array I tried it on. I'd be interested to see what you have found.

Mattijs

mattijsjanssens avatar Dec 09 '19 16:12 mattijsjanssens

I know it - good stuff! It's not faster as it is, I know (OpenFOAM is highly optimized) but the hybrid approach should make a difference using the GPU(s) as "additional high performance core(s)".

Klaus

klausbu avatar Dec 09 '19 17:12 klausbu

The only backend in amgcl that would allow using both CPU and GPU at the same time is VexCL over OpenCL:

./examples/solver_vexcl_cl -n 128
1. Intel(R) Core(TM) i7 CPU         920  @ 2.67GHz (Intel(R) CPU Runtime for OpenCL(TM) Applications)
2. Tesla K40c (NVIDIA CUDA)

Solver
======
Type:             BiCGStab
Unknowns:         2097152
Memory footprint: 112.00 M

Preconditioner
==============
Number of levels:    4
Operator complexity: 1.62
Grid complexity:     1.13
Memory footprint:    744.74 M

level     unknowns       nonzeros      memory
---------------------------------------------
    0      2097152       14581760    553.22 M (61.61%)
    1       263552        7918340    168.88 M (33.46%)
    2        16128        1114704     20.01 M ( 4.71%)
    3          789          53055      2.62 M ( 0.22%)

Iterations: 10
Error:      2.50965e-09

[Profile:          6.153 s] (100.00%)
[ self:            0.539 s] (  8.75%)
[  assembling:     0.378 s] (  6.14%)
[  setup:          3.124 s] ( 50.78%)
[  solve:          2.112 s] ( 34.32%)

VexCL will divide each matrix and vector between the compute devices. Each backend operation, such as spmv or inner_product will be performed by all devices in parallel:

https://speakerdeck.com/ddemidov/vexcl-gpgpu-without-the-agonizing-pain?slide=27

But. although it is technically possible to use CPU and GPU at the same time, it is very hard to balance the workload between the devices so that the solution is actually faster than each compute device separately. I think it would be easier to use MPI version, where each MPI process would use a single compute device. Balancing the workload between such MPI processes should be more straightforward.

Is there a way to link the amg or better amg_mpi backend with the VexCL or cuda backend? Or maybe there's a better approach?

You can use any backend with serial or MPI version of amgcl just by specifying the correct backend class.

ddemidov avatar Dec 09 '19 18:12 ddemidov

@mattijsjanssens , I think that was quite some time ago, maybe it is time to reevaluate? I could help if you are interested :).

ddemidov avatar Dec 09 '19 18:12 ddemidov

FYI: I referred to the thesis "Design and Optimization of OpenFOAM-based CFD Applications for Modern Hybrid and Heterogeneous HPC Platforms" by Ms. Amani AlOnazi, 2013

klausbu avatar Dec 09 '19 22:12 klausbu