cp2k Fix crashes and load imbalance for multi GPU

I ran the H2O-64.inp benchmarks using 4 MPI, 1 OpenMP, and 4 P100 GPUs on a dual socket EPYC node.

Full timing report

 -------------------------------------------------------------------------------
 -                                                                             -
 -                                T I M I N G                                  -
 -                                                                             -
 -------------------------------------------------------------------------------
 SUBROUTINE                       CALLS  ASD         SELF TIME        TOTAL TIME
                                MAXIMUM       AVERAGE  MAXIMUM  AVERAGE  MAXIMUM
 CP2K                                 1  1.0    0.011    0.015  165.420  165.421
 qs_mol_dyn_low                       1  2.0    0.003    0.003  165.238  165.242
 qs_forces                           11  3.9    0.001    0.001  165.200  165.201
 qs_energies                         11  4.9    0.001    0.001  154.107  156.315
 scf_env_do_scf                      11  5.9    0.001    0.001  123.157  123.159
 scf_env_do_scf_inner_loop          108  6.5    0.004    0.009  115.704  115.704
 dbcsr_multiply_generic            2286 12.5    0.088    0.096   72.655  102.077
 qs_scf_new_mos                     108  7.5    0.001    0.001   50.200   82.821
 qs_scf_loop_do_ot                  108  8.5    0.001    0.001   50.199   82.820
 multiply_cannon                   2286 13.5    0.166    0.172   52.603   78.922
 multiply_cannon_loop              2286 14.5    0.110    0.116   51.071   78.250
 multiply_cannon_multrec           4572 15.5    1.934    2.069   40.776   68.453
 velocity_verlet                     10  3.0    0.001    0.001   65.750   65.757
 mp_waitall_1                     76074 16.8   33.594   55.828   33.594   55.828
 qs_rho_update_rho                  119  7.7    0.001    0.001   33.241   53.626
 calculate_rho_elec                 119  8.7    0.392    0.427   33.241   53.625
 density_rs2pw                      119  9.7    0.005    0.005   30.334   50.881
 rs_pw_transfer                     974 11.9    0.012    0.013   25.881   46.537
 multiply_cannon_multrec_finali    2286 16.5    0.003    0.004   19.203   45.887
 dbcsr_mm_multrec_finalize         2286 17.5    0.041    0.045   19.200   45.884
 dbcsr_mm_sched_finalize           2286 18.5   19.155   45.835   19.158   45.839
 calculate_dm_sparse                119  9.5    0.001    0.001   14.939   45.494
 rs_pw_transfer_RS2PW_140           130 11.5    0.806    0.968   22.705   43.336
 mp_bcast_b                        1787 13.7   10.363   40.997   10.363   40.997
 external_control                   118  7.2    0.000    0.001   10.357   40.982
 ot_scf_mini                        108  9.5    0.002    0.003   36.672   38.747
 rebuild_ks_matrix                  119  8.3    0.001    0.001   27.630   31.251
 qs_ks_build_kohn_sham_matrix       119  9.3    0.018    0.020   27.629   31.251
 qs_ks_update_qs_env                119  7.6    0.001    0.002   25.257   28.887
 ot_mini                            108 10.5    0.001    0.001   26.234   28.306
 init_scf_run                        11  5.9    0.000    0.001   24.028   24.028
 scf_env_initial_rho_setup           11  6.9    0.000    0.001   24.027   24.028
 qs_ot_get_derivative               108 11.5    0.001    0.001   19.232   21.317
 cp_dbcsr_sm_fm_multiply             37  9.5    0.001    0.001   19.641   19.646
 cp_dbcsr_sm_fm_multiply_core        37 10.5    0.000    0.000    9.190   19.402
 dbcsr_mm_hostdrv_process          7970 16.0   17.677   18.471   17.677   18.471
 calculate_first_density_matrix       1  7.0    0.000    0.000   18.370   18.378
 mp_sum_l                          7127 13.1   12.874   15.110   12.874   15.110
 mp_alltoall_i22                    627 13.8   10.914   15.091   10.914   15.091
 copy_dbcsr_to_fm                   153 11.3    0.002    0.003   11.012   14.925
 pw_transfer                       1439 11.6    0.085    0.087   14.678   14.776
 fft_wrap_pw1pw2                   1201 12.6    0.009    0.010   14.410   14.468
 dbcsr_desymmetrize_deep            153 12.3    0.056    0.064   10.399   14.294
 fft_wrap_pw1pw2_140                487 13.2    0.627    0.662   12.874   13.028
 sum_up_and_integrate               119 10.3    0.059    0.061    9.628   10.557
 integrate_v_rspace                 119 11.3    0.249    0.282    9.569   10.498
 fft3d_ps                          1201 14.6    6.610    6.727   10.009   10.115
 qs_ot_get_derivative_taylor         59 13.0    0.001    0.001    6.975    8.543
 qs_ot_get_derivative_diag           49 12.0    0.001    0.001    7.140    7.481
 init_scf_loop                       11  6.9    0.000    0.000    7.422    7.422
 apply_preconditioner_dbcsr         119 12.6    0.000    0.000    7.255    7.349
 apply_single                       119 13.6    0.000    0.000    7.255    7.349
 make_m2s                          4572 13.5    0.032    0.034    6.350    7.222
 make_images                       4572 14.5    0.464    0.468    6.247    7.117
 ot_diis_step                       108 11.5    0.004    0.004    6.980    6.980
 multiply_cannon_metrocomm1        4572 15.5    0.008    0.010    3.795    6.336
 potential_pw2rs                    119 12.3    0.009    0.010    6.290    6.302
 build_core_hamiltonian_matrix_      11  4.9    0.001    0.001    6.019    6.238
 multiply_cannon_metrocomm3        4572 15.5    0.007    0.008    3.040    6.068
 wfi_extrapolate                     11  7.9    0.001    0.001    5.488    5.488
 qs_energies_init_hamiltonians       11  5.9    0.000    0.000    5.023    5.023
 make_images_data                  4572 15.5    0.031    0.036    3.809    4.797
 cp_dbcsr_plus_fm_fm_t_native        22  8.9    0.000    0.001    2.535    4.749
 hybrid_alltoall_any               4725 16.4    0.725    1.956    3.609    4.533
 prepare_preconditioner              11  7.9    0.000    0.000    4.449    4.454
 make_preconditioner                 11  8.9    0.000    0.000    4.449    4.454
 acc_transpose_blocks              4572 15.5    0.014    0.015    3.324    4.436
 acc_transpose_blocks_kernels      4572 16.5    0.041    0.045    3.248    4.356
 jit_kernel_transpose                 5 15.6    3.207    4.311    3.207    4.311
 mp_sum_d                          4117 12.0    3.095    4.281    3.095    4.281
 make_full_inverse_cholesky          11  9.9    0.000    0.000    3.594    4.144
 qs_energies_compute_matrix_w        11  5.9    0.000    0.000    1.845    4.050
 calculate_w_matrix_ot               11  6.9    0.001    0.001    1.845    4.050
 grid_integrate_task_list           119 12.3    3.029    3.960    3.029    3.960
 build_core_ppl_forces               11  5.9    3.476    3.679    3.476    3.679
 calculate_ecore_overlap             22  5.9    0.001    0.001    2.475    3.371
 -------------------------------------------------------------------------------

What is striking is that:

dbcsr_multiply_generic consumes over 60% of the time, while it's usually only around 25%.
dbcsr_mm_sched_finalize and mp_waitall_1 experience significant load imbalance.

So, we might be suffering from a bad assignment of GPUs to MPI ranks.

@alazzaro started to look into it, but we have not yet switch to his new mp_get_node_global_rank() routine.

Nov 17 '21 12:11 oschuett

I found another (if not bigger) problem. I tested with a dual-socket system and one V100 per socket, i.e., multi-GPU. I got H2O-64 test case to crash if I did not use multirun.sh. I thought, CP2K and DBCSR can now handle multiple GPUs correctly like no need for manipulating the CUDA_VISIBLE_DEVICES? My experiments below are based on CP2K/master and DBCSR/develop.

I tried the following with two GPUs (multirun.sh 2 ...), which works:

mpirun -host localhost -genvall -np 4 \
  -genv I_MPI_PIN_DOMAIN auto \
  -genv I_MPI_PIN_ORDER bunch \
  -genv OMP_PROC_BIND TRUE \
/path/to/multirun.sh 2 /path/to/cp2k.psmp H2O-64.inp

I also tried the following with two GPUs (no multirun.sh), which does not work:

mpirun -host localhost -genvall -np 4 \
  -genv I_MPI_PIN_DOMAIN auto \
  -genv I_MPI_PIN_ORDER bunch \
  -genv OMP_PROC_BIND TRUE \
/path/to/cp2k.psmp H2O-64.inp

Even the case with two ranks (and two GPUs) crashes:

mpirun -host localhost -genvall -np 2 \
  -genv I_MPI_PIN_DOMAIN auto \
  -genv I_MPI_PIN_ORDER bunch \
  -genv OMP_PROC_BIND TRUE \
/path/to/cp2k.psmp H2O-64.inp

Specifically, I get a bunch of CUDA error: invalid argument errors, and subsequently usually acc_devmem_setzero crashes (in multiply_cannon).

Nov 18 '21 09:11 hfp

we have not yet switched to his new mp_get_node_global_rank() routine.

I believe this work must be about the local rank number. The global rank like GLOBAL_RANK_ID % NDEVICES is the way CP2K currently uses?

( As an addendum, the above mentioned multirun.sh also uses just global rank number, i.e., no awareness of the socket/PCIe-slot. )

Nov 18 '21 09:11 hfp

I got H2O-64 test case to crash....

I can confirm this. On my dual GPU system cp2k crashes when run with more than one MPI rank and more than two threads.

The problem reaches back to at least 6d12638defcb1e6d69d826d4015702f810dba719, which makes me wonder why nobody has noticed this before.

Nov 19 '21 17:11 oschuett

I got H2O-64 test case to crash....

I can confirm this. On my dual GPU system cp2k crashes when run with more than one MPI rank and more than two threads.

The problem reaches back to at least 6d12638, which makes me wonder why nobody has noticed this before.

Running tests/dbcsr_perf from DBCSR itself using multiple ranks/threads does not seem to reproduce the "multi-GPU crashes". Therefore, CP2K seems to "drive" DBCSR differently than DBCSR's own reproducer/test. Further, I cannot see significant imbalance running tests/dbcsr_perf on multiple GPUs.

I just wanted to note, that some "CUDA quirks" is about thread-local active device. A number of CUDA functions implicitly refer to an "active device" similar to a global variable (but at least thread-local). If CP2K activates a device as per latest policy for handling (multiple) GPUs, it should be for every thread.

I also do not oversee what happens if the active device on a per-thread basis changes due to a different policy/scheme.

Nov 26 '21 14:11 hfp