amgcl
amgcl copied to clipboard
AMGCL hybrid mode using CPU(s)+GPU(s)
Hi Denis,
I'd like to run OpenFOAM in a hybrid mode using CPU(s)+GPU(s), splitting the workload between CPU(s) and GPU(s) on a single node/workstation, using the GPU(s) as booster(s). This way, GPU(s) can be leveraged when available considering usually limited available GPU memory, compute performance and PCI-E bandwidth. A custom PCG implementation showed significant performance gains but wasn't published and didn't leverage preconditioner.
Is there a way to link the amg or better amg_mpi backend with the VexCL or cuda backend? Or maybe there's a better approach?
Klaus
I did once write an OpenFOAM interface to Amgcl (https://github.com/mattijsjanssens/mattijs-extensions/tree/master/applications/test/amgcl/amgclSolver) but that used AMG preconditioned GC and wasn't faster on the array I tried it on. I'd be interested to see what you have found.
Mattijs
I know it - good stuff! It's not faster as it is, I know (OpenFOAM is highly optimized) but the hybrid approach should make a difference using the GPU(s) as "additional high performance core(s)".
Klaus
The only backend in amgcl that would allow using both CPU and GPU at the same time is VexCL over OpenCL:
./examples/solver_vexcl_cl -n 128
1. Intel(R) Core(TM) i7 CPU 920 @ 2.67GHz (Intel(R) CPU Runtime for OpenCL(TM) Applications)
2. Tesla K40c (NVIDIA CUDA)
Solver
======
Type: BiCGStab
Unknowns: 2097152
Memory footprint: 112.00 M
Preconditioner
==============
Number of levels: 4
Operator complexity: 1.62
Grid complexity: 1.13
Memory footprint: 744.74 M
level unknowns nonzeros memory
---------------------------------------------
0 2097152 14581760 553.22 M (61.61%)
1 263552 7918340 168.88 M (33.46%)
2 16128 1114704 20.01 M ( 4.71%)
3 789 53055 2.62 M ( 0.22%)
Iterations: 10
Error: 2.50965e-09
[Profile: 6.153 s] (100.00%)
[ self: 0.539 s] ( 8.75%)
[ assembling: 0.378 s] ( 6.14%)
[ setup: 3.124 s] ( 50.78%)
[ solve: 2.112 s] ( 34.32%)
VexCL will divide each matrix and vector between the compute devices. Each backend operation, such as spmv
or inner_product
will be performed by all devices in parallel:
https://speakerdeck.com/ddemidov/vexcl-gpgpu-without-the-agonizing-pain?slide=27
But. although it is technically possible to use CPU and GPU at the same time, it is very hard to balance the workload between the devices so that the solution is actually faster than each compute device separately. I think it would be easier to use MPI version, where each MPI process would use a single compute device. Balancing the workload between such MPI processes should be more straightforward.
Is there a way to link the amg or better amg_mpi backend with the VexCL or cuda backend? Or maybe there's a better approach?
You can use any backend with serial or MPI version of amgcl just by specifying the correct backend class.
@mattijsjanssens , I think that was quite some time ago, maybe it is time to reevaluate? I could help if you are interested :).
FYI: I referred to the thesis "Design and Optimization of OpenFOAM-based CFD Applications for Modern Hybrid and Heterogeneous HPC Platforms" by Ms. Amani AlOnazi, 2013