ginkgo
ginkgo copied to clipboard
When can we expect the tutorials - I'd like to follow along with a CFD implementation
Hello,
I am interested in the "Tutorial: Building a Poisson Solver" tutorial and I'd like to follow along with an implementation in a CFD code.
The following aspects would be of particular importance:
- What's the fastest way to import, deep copy... the linear system to ginkgo? Assume I have a CSR matrix (columns, row_ptr, values; 3 c++ vectors) and two c++ vectors (x and rhs) which are imported at each timestep (x = 0 applies only to the first timestep! of up to 10.000 timesteps); x and the final residual get copied back to the CFD application at the end of each timestep.
- To manage the overall simulation (coupled linear systems), I'll need the following BLAS routines, too: ABS, reduction, axbpy.
- I'd like to apply mixed precision iterative refinement fp64/fp32 on the CPU or better fp64(CPU)/fp32(GPU) to leverage modern engineering workstations
- I miss an IC or AMG preconditioner which are usually most effective for solving the symmetric matrix of the pressure equation in combination with a CG solver. I'd like to use both mixed precision CG (for symmetric matrices) and BiCGStab (for asymmetric matrices) - GMRES could work for both but it's usually "rough" and less efficient for this application so the simulation needs more time and the results are not so good as the convergence of the coupled system is not so smooth.
klausbu
Hello @klausbu,
thank you for your interest in Ginkgo.
- The fastest way to import a linear System or a matrix in general in Ginkgo is with
Array<value_type>::view
. You can see how it works in the three-pt-stencil-solver Line 192+. You basically create a matrix from a view of your existing vectors, avoiding any copies. However, you should to make sure the data is on the same executor as your solver, otherwise, Ginkgo copies it before calling kernels. The actual data from the vectors will be updated after each kernel call (so also each iteration) regardless of if the executor of the view match with the solver (if they match, no data will be copied, if they mismatch, data will be copied after each kernel call) - We are on our way to implement an ABS function. With reduction, do you mean norm or dot? We do have a
compute_norm2
, compute_dot andadd_scaled
. - I don't quite understand what you mean with "fp64(CPU)/fp32(GPU)"? Doing some operations on the CPU and others on the GPU is pretty much always slower than doing everything on the CPU as long as you are memory bound because copying to GPU has lower bandwidth than operating on the CPU. We do have an example for mixed precision IR here
- An AMG preconditioner is on its way (see PR #528 as a first step), an IC does already exist, but it can only be used with the
ILU
interface. You can create an ICT preconditioner-factory with:auto fact = std::shared_ptr<gko::LinOpFactory>( gko::factorization::ParIct<ValueType>::build() .with_iterations(parilu_iterations) .with_approximate_select(parilut_approx_select) .with_fill_in_limit(parilut_limit) .on(exec)); auto precond_factory = gko::preconditioner::Ilu<ValueType>::build() .with_factorization_factory(fact) .on(exec);
If I missed something or have more questions, feel free to ask.
About other matrix operations needed:
- ABS() means in this context to convert negative vector elements into positive ones
- reduction() means adding-up all vector elements
- I'll also need a spmv multiplication
This is the entire calculation to scale the residual which is used for the overall management of the convergence of a simulation:
For matrix Ax = b, residual is defined as res = b - Ax
double x_average = reduction(x)/x_size
create a vector v_x_average with all values set to x_average
vector wA = Ax
vector pA = Ax_average
double normFactor = reduction(ABS(wA - pA) + ABS(b - pA)) + SMALL where SMALL is a very small value to avoid divisions by 0
The scaled residual is then:
double scaled_residual = reduction(ABS(b - wA))/normFactor
ABS() means in this context to convert negative vector elements into positive ones
We are currently working on a get_absolute()
implementation that computes the absolute of each vector. That includes both complex values (which will be converted to a real type) and real values, which will act as you described. But that is not yet merged.
reduction() means adding-up all vector elements
This type of reduction can currently be implemented with compute_dot
by creating a vector of ones. Alternatively, you can also sum it up yourself (if it is on GPU, you need to copy it to CPU first).
I'll also need a spmv multiplication
We do have SpMV multiplication with each matrix format we have (b = Ax
is computed with A->apply(x, b)
). That also works with dense matrices.
Currently, you would need to write the filling of the x_average
vector, the ABS()
and the reduction()
function by yourself (should be fairly easy since you can access dense matrix values with at(x, y)
). The ABS()
function is currently being implemented, but the others will take more time unfortunately.
If you want, I can write an example-code that does what you want, and you can use that as a starting point.
In general, we are currently discussing on how to improve our tutorials and wiki-page, so newcomers have an easy to follow path learning about the Ginkgo interface and capabilities.
@Thomas: Maybe just list the available matrix solvers, LA functions and work in progress (WIP) features somewhere. The main obstacle is usually to provide/load the raw data (matrix in COO or CSR format, vectors and process(or) boundaries) efficiently, so some practical example would be helpful. I assume here a MPI based multi GPU implementation.
For a CFD implementation (using Ginkgo as external matrix solver library for a "deep" on GPU implementation i.e. flow + matrix solver), the following features would be great (MPI parallelism, PCG with parallel on GPU IC (multi coloured or tiled approach compatible with a scotch domain decomposition), an AMG solver and preconditioner, for some industries an orthomin solver, all for pressure / symmetric matrices. For asymmetric matrices (velocity etc.) PBiCGStab or better PBiCGStab(L) as the latter version tends to converge even smoother would be great (IDR, GMRES, PBiCG etc. are usually less stable and less smooth so one can loose the last bid of lift and drag calculation accuracy).
@klausbu All currently usable solvers can be found here: https://ginkgo-project.github.io/ginkgo/doc/develop/group__solvers.html
We do not have a MPI implementation of Ginkgo, we focus on node level parallelism (using the CPU and GPU efficiently). We do have Preconditioned CG, but don't have Incomplete Cholesky (from what I know, adding it would not be too difficult). An AMG solver is on its way (see #528), but we don't have an orthomin solver. We also do have Preconditioned BiCGStab (but not BiCGStabL).
For feature requests, I will create an Issue, so we can keep track of it.
Please add multi-GPU support to the feature request too.
In my experience, OpenMP/shared memory implementations on the CPU are a lot slower than MPI/distributed memory implementations but you're the experts, high node level performance with multiple GPUs is what I am looking for.
I just added that (Issue #639). @tcojean is actually currently exploring this possibility. However, it is possible that it does not improve performance over a single GPU because pretty much all of our operations are memory-bound. Copying data from one to the other GPU has significantly less bandwidth than the memory bandwidth in one GPU, so getting good performance from it is not trivial.
That's why distributed memory implementations using MPI or GASPI are commonly used, as a single GPU is just not powerful enough and doesn't offer sufficient memory to solve real-world-problems, not to mention when using mixed precision iterative refinement.