amgcl icon indicating copy to clipboard operation
amgcl copied to clipboard

Wrong behavior of MPI vexcl backend with OpenCL driver update

Open davidherreroperez opened this issue 5 years ago • 5 comments
trafficstars

Hi,

I am using several AMD Radeon VII with MPI vexcl backend, compiled with boost 1.72, and I have observed different results (number of iterations) using different OpenCL drivers:

Radeon™ Software for Linux® 20.10 (Ubuntu 18.04) I always obtain the same results with the following execution:

$mpirun -n 1 -x OMP_NUM_THREADS=1 ./mpi_amg_vexcl_cl
World size: 1
1. gfx906 (AMD Accelerated Parallel Processing)
2. gfx906 (AMD Accelerated Parallel Processing)
3. gfx906 (AMD Accelerated Parallel Processing)
4. gfx906 (AMD Accelerated Parallel Processing)

Type:             BiCGStab
Unknowns:         2097152
Memory footprint: 112.00 M

Number of levels:    4
Operator complexity: 1.62
Grid complexity:     1.13

level     unknowns       nonzeros
---------------------------------
    0      2097152       14581760 (61.61%) [1]
    1       263552        7918340 (33.46%) [1]
    2        16128        1114704 ( 4.71%) [1]
    3          789          53055 ( 0.22%) [1]

Iterations: 10
Error:      2.50965e-09

[Profile:        5.991 s] (100.00%)
[ self:          0.757 s] ( 12.63%)
[  assemble:     0.176 s] (  2.93%)
[  setup:        3.643 s] ( 60.81%)
[  solve:        1.416 s] ( 23.63%)

Radeon™ Software for Linux® 20.20 (Ubuntu 18.04) The number of iterations changes randomly in different runs

$ mpirun -n 1 -x OMP_NUM_THREADS=1 ./mpi_amg_vexcl_cl 
World size: 1
1. gfx906 (AMD Accelerated Parallel Processing)
2. gfx906 (AMD Accelerated Parallel Processing)
3. gfx906 (AMD Accelerated Parallel Processing)
4. gfx906 (AMD Accelerated Parallel Processing)

Type:             BiCGStab
Unknowns:         2097152
Memory footprint: 112.00 M

Number of levels:    4
Operator complexity: 1.62
Grid complexity:     1.13

level     unknowns       nonzeros
---------------------------------
    0      2097152       14581760 (61.61%) [1]
    1       263552        7918340 (33.46%) [1]
    2        16128        1114704 ( 4.71%) [1]
    3          789          53055 ( 0.22%) [1]

Iterations: 21
Error:      4.22518e-09

[Profile:        6.120 s] (100.00%)
[ self:          0.746 s] ( 12.20%)
[  assemble:     0.183 s] (  2.99%)
[  setup:        3.639 s] ( 59.46%)
[  solve:        1.552 s] ( 25.35%)

$ mpirun -n 1 -x OMP_NUM_THREADS=1 ./mpi_amg_vexcl_cl 
World size: 1
1. gfx906 (AMD Accelerated Parallel Processing)
2. gfx906 (AMD Accelerated Parallel Processing)
3. gfx906 (AMD Accelerated Parallel Processing)
4. gfx906 (AMD Accelerated Parallel Processing)

Type:             BiCGStab
Unknowns:         2097152
Memory footprint: 112.00 M

Number of levels:    4
Operator complexity: 1.62
Grid complexity:     1.13

level     unknowns       nonzeros
---------------------------------
    0      2097152       14581760 (61.61%) [1]
    1       263552        7918340 (33.46%) [1]
    2        16128        1114704 ( 4.71%) [1]
    3          789          53055 ( 0.22%) [1]

Iterations: 39
Error:      9.88154e-09

[Profile:        6.567 s] (100.00%)
[ self:          0.739 s] ( 11.25%)
[  assemble:     0.184 s] (  2.80%)
[  setup:        3.695 s] ( 56.26%)
[  solve:        1.950 s] ( 29.69%)

...

Surprisingly, I have checked solver_vexcl_cl and mixed_precision_vexcl_cl, obtaining the right results using both drivers. I also have to mention that I am having mpi hangs using MPI vexcl backend with large problems (after several hours running many solving) but I am not able to reproduce the problem systematically yet. This is not happening using builtin backend implementation. I appreciate any comments about this.

davidherreroperez avatar Jul 28 '20 22:07 davidherreroperez

Looks like a synchronization bug somewhere. Unfortunately, these are usually hard to pinpoint. A few things I would try:

  • You are using 4 GPUs with a single MPI process. What happens if you restrict each MPI process to a single GPU (by exporting OCL_MAX_DEVICES=1)?
  • This could be a reincarnation of an old AMD bug, (https://community.amd.com/message/1295492) where the compute queue seemingly overflowed and one had to sprinkle the computations with clFinish() calls. You could check for this by adding
q.finish();

here.

  • By the way, there is another workaround for an AMD bug two lines above described in ddemidov/vexcl#254. You can check if that works for you by defining VEXCL_AMD_SI_WORKAROUND.
  • If nothing above helps, I would start printing values of scalar variables (rho1, rho2, alpha, omega, res) in the BiCGStab solver in hope to catch the exact moment the good and the bad solutions diverge. Hopefully, when that happens, it should be easier to find the exact misbehaving operation.

I also have to mention that I am having mpi hangs using MPI vexcl backend with large problems (after several hours running many solving) but I am not able to reproduce the problem systematically yet.

This may indicate that there may be a bug in the vexcl backend. If you are able to provide a reproducible example that indeed would be very helpful.

ddemidov avatar Jul 29 '20 06:07 ddemidov

Looks like a synchronization bug somewhere. Unfortunately, these are usually hard to pinpoint. A few things I would try:

  • This could be a reincarnation of an old AMD bug, where the compute queue seemingly overflowed and one had to sprinkle the computations with clFinish() calls. You could check for this by adding
q.finish();

here.

This solved the problem.

I also have to mention that I am having mpi hangs using MPI vexcl backend with large problems (after several hours running many solving) but I am not able to reproduce the problem systematically yet.

This may indicate that there may be a bug in the vexcl backend. If you are able to provide a reproducible example that indeed would be very helpful.

I will check if the mpi hangs still occurs with the solution you mention above and I will inform you. Many thanks for the solution.

davidherreroperez avatar Jul 29 '20 07:07 davidherreroperez

This solved the problem.

How did the runtimes change after this? The "solution" is not very effective. Can you also enable the VEXCL_AMD_SI_WORKAROUND preprocessor macro and see if it works for you (without the above solution). If that works as well, I wonder which is more efficient.

ddemidov avatar Jul 29 '20 08:07 ddemidov

This solved the problem.

How did the runtimes change after this? The "solution" is not very effective.

The solving is slower (2.123 s compared to 1.416 s using 20.10 driver without q.finish()).

$ mpirun -n 1 -x OMP_NUM_THREADS=1 ./mpi_amg_vexcl_cl
World size: 1
1. gfx906 (AMD Accelerated Parallel Processing)
2. gfx906 (AMD Accelerated Parallel Processing)
3. gfx906 (AMD Accelerated Parallel Processing)
4. gfx906 (AMD Accelerated Parallel Processing)

Type:             BiCGStab
Unknowns:         2097152
Memory footprint: 112.00 M

Number of levels:    4
Operator complexity: 1.62
Grid complexity:     1.13

level     unknowns       nonzeros
---------------------------------
    0      2097152       14581760 (61.61%) [1]
    1       263552        7918340 (33.46%) [1]
    2        16128        1114704 ( 4.71%) [1]
    3          789          53055 ( 0.22%) [1]

Iterations: 10
Error:      2.50965e-09

[Profile:        6.766 s] (100.00%)
[ self:          0.796 s] ( 11.77%)
[  assemble:     0.186 s] (  2.75%)
[  setup:        3.661 s] ( 54.10%)
[  solve:        2.123 s] ( 31.37%)

Can you also enable the VEXCL_AMD_SI_WORKAROUND preprocessor macro and see if it works for you (without the above solution). If that works as well, I wonder which is more efficient.

Enabling the VEXCL_AMD_SI_WORKAROUND macro does not solve the problem in my system. The number of iterations still changes randomly.

davidherreroperez avatar Jul 29 '20 08:07 davidherreroperez

I would check if adding something like

for(auto &q : x.queue_list()) q.finish();

at the end of the loop here would still work, but be more efficient. This is a dirty hack that would only compile for the vexcl backend, but it may help you until the problem is solved. A bit more generic solution would be to add such calls at the end of some (so that clFinish() would be called from time to time, and not after every vexcl kernel) amgcl primitives implementations in vexcl backend code. For example, here.

ddemidov avatar Jul 29 '20 08:07 ddemidov