Fix crashes and load imbalance for multi GPU
I ran the H2O-64.inp benchmarks using 4 MPI, 1 OpenMP, and 4 P100 GPUs on a dual socket EPYC node.
Full timing report
-------------------------------------------------------------------------------
- -
- T I M I N G -
- -
-------------------------------------------------------------------------------
SUBROUTINE CALLS ASD SELF TIME TOTAL TIME
MAXIMUM AVERAGE MAXIMUM AVERAGE MAXIMUM
CP2K 1 1.0 0.011 0.015 165.420 165.421
qs_mol_dyn_low 1 2.0 0.003 0.003 165.238 165.242
qs_forces 11 3.9 0.001 0.001 165.200 165.201
qs_energies 11 4.9 0.001 0.001 154.107 156.315
scf_env_do_scf 11 5.9 0.001 0.001 123.157 123.159
scf_env_do_scf_inner_loop 108 6.5 0.004 0.009 115.704 115.704
dbcsr_multiply_generic 2286 12.5 0.088 0.096 72.655 102.077
qs_scf_new_mos 108 7.5 0.001 0.001 50.200 82.821
qs_scf_loop_do_ot 108 8.5 0.001 0.001 50.199 82.820
multiply_cannon 2286 13.5 0.166 0.172 52.603 78.922
multiply_cannon_loop 2286 14.5 0.110 0.116 51.071 78.250
multiply_cannon_multrec 4572 15.5 1.934 2.069 40.776 68.453
velocity_verlet 10 3.0 0.001 0.001 65.750 65.757
mp_waitall_1 76074 16.8 33.594 55.828 33.594 55.828
qs_rho_update_rho 119 7.7 0.001 0.001 33.241 53.626
calculate_rho_elec 119 8.7 0.392 0.427 33.241 53.625
density_rs2pw 119 9.7 0.005 0.005 30.334 50.881
rs_pw_transfer 974 11.9 0.012 0.013 25.881 46.537
multiply_cannon_multrec_finali 2286 16.5 0.003 0.004 19.203 45.887
dbcsr_mm_multrec_finalize 2286 17.5 0.041 0.045 19.200 45.884
dbcsr_mm_sched_finalize 2286 18.5 19.155 45.835 19.158 45.839
calculate_dm_sparse 119 9.5 0.001 0.001 14.939 45.494
rs_pw_transfer_RS2PW_140 130 11.5 0.806 0.968 22.705 43.336
mp_bcast_b 1787 13.7 10.363 40.997 10.363 40.997
external_control 118 7.2 0.000 0.001 10.357 40.982
ot_scf_mini 108 9.5 0.002 0.003 36.672 38.747
rebuild_ks_matrix 119 8.3 0.001 0.001 27.630 31.251
qs_ks_build_kohn_sham_matrix 119 9.3 0.018 0.020 27.629 31.251
qs_ks_update_qs_env 119 7.6 0.001 0.002 25.257 28.887
ot_mini 108 10.5 0.001 0.001 26.234 28.306
init_scf_run 11 5.9 0.000 0.001 24.028 24.028
scf_env_initial_rho_setup 11 6.9 0.000 0.001 24.027 24.028
qs_ot_get_derivative 108 11.5 0.001 0.001 19.232 21.317
cp_dbcsr_sm_fm_multiply 37 9.5 0.001 0.001 19.641 19.646
cp_dbcsr_sm_fm_multiply_core 37 10.5 0.000 0.000 9.190 19.402
dbcsr_mm_hostdrv_process 7970 16.0 17.677 18.471 17.677 18.471
calculate_first_density_matrix 1 7.0 0.000 0.000 18.370 18.378
mp_sum_l 7127 13.1 12.874 15.110 12.874 15.110
mp_alltoall_i22 627 13.8 10.914 15.091 10.914 15.091
copy_dbcsr_to_fm 153 11.3 0.002 0.003 11.012 14.925
pw_transfer 1439 11.6 0.085 0.087 14.678 14.776
fft_wrap_pw1pw2 1201 12.6 0.009 0.010 14.410 14.468
dbcsr_desymmetrize_deep 153 12.3 0.056 0.064 10.399 14.294
fft_wrap_pw1pw2_140 487 13.2 0.627 0.662 12.874 13.028
sum_up_and_integrate 119 10.3 0.059 0.061 9.628 10.557
integrate_v_rspace 119 11.3 0.249 0.282 9.569 10.498
fft3d_ps 1201 14.6 6.610 6.727 10.009 10.115
qs_ot_get_derivative_taylor 59 13.0 0.001 0.001 6.975 8.543
qs_ot_get_derivative_diag 49 12.0 0.001 0.001 7.140 7.481
init_scf_loop 11 6.9 0.000 0.000 7.422 7.422
apply_preconditioner_dbcsr 119 12.6 0.000 0.000 7.255 7.349
apply_single 119 13.6 0.000 0.000 7.255 7.349
make_m2s 4572 13.5 0.032 0.034 6.350 7.222
make_images 4572 14.5 0.464 0.468 6.247 7.117
ot_diis_step 108 11.5 0.004 0.004 6.980 6.980
multiply_cannon_metrocomm1 4572 15.5 0.008 0.010 3.795 6.336
potential_pw2rs 119 12.3 0.009 0.010 6.290 6.302
build_core_hamiltonian_matrix_ 11 4.9 0.001 0.001 6.019 6.238
multiply_cannon_metrocomm3 4572 15.5 0.007 0.008 3.040 6.068
wfi_extrapolate 11 7.9 0.001 0.001 5.488 5.488
qs_energies_init_hamiltonians 11 5.9 0.000 0.000 5.023 5.023
make_images_data 4572 15.5 0.031 0.036 3.809 4.797
cp_dbcsr_plus_fm_fm_t_native 22 8.9 0.000 0.001 2.535 4.749
hybrid_alltoall_any 4725 16.4 0.725 1.956 3.609 4.533
prepare_preconditioner 11 7.9 0.000 0.000 4.449 4.454
make_preconditioner 11 8.9 0.000 0.000 4.449 4.454
acc_transpose_blocks 4572 15.5 0.014 0.015 3.324 4.436
acc_transpose_blocks_kernels 4572 16.5 0.041 0.045 3.248 4.356
jit_kernel_transpose 5 15.6 3.207 4.311 3.207 4.311
mp_sum_d 4117 12.0 3.095 4.281 3.095 4.281
make_full_inverse_cholesky 11 9.9 0.000 0.000 3.594 4.144
qs_energies_compute_matrix_w 11 5.9 0.000 0.000 1.845 4.050
calculate_w_matrix_ot 11 6.9 0.001 0.001 1.845 4.050
grid_integrate_task_list 119 12.3 3.029 3.960 3.029 3.960
build_core_ppl_forces 11 5.9 3.476 3.679 3.476 3.679
calculate_ecore_overlap 22 5.9 0.001 0.001 2.475 3.371
-------------------------------------------------------------------------------
What is striking is that:
-
dbcsr_multiply_genericconsumes over 60% of the time, while it's usually only around 25%. -
dbcsr_mm_sched_finalizeandmp_waitall_1experience significant load imbalance.
So, we might be suffering from a bad assignment of GPUs to MPI ranks.
@alazzaro started to look into it, but we have not yet switch to his new mp_get_node_global_rank() routine.
I found another (if not bigger) problem. I tested with a dual-socket system and one V100 per socket, i.e., multi-GPU. I got H2O-64 test case to crash if I did not use multirun.sh. I thought, CP2K and DBCSR can now handle multiple GPUs correctly like no need for manipulating the CUDA_VISIBLE_DEVICES? My experiments below are based on CP2K/master and DBCSR/develop.
I tried the following with two GPUs (multirun.sh 2 ...), which works:
mpirun -host localhost -genvall -np 4 \
-genv I_MPI_PIN_DOMAIN auto \
-genv I_MPI_PIN_ORDER bunch \
-genv OMP_PROC_BIND TRUE \
/path/to/multirun.sh 2 /path/to/cp2k.psmp H2O-64.inp
I also tried the following with two GPUs (no multirun.sh), which does not work:
mpirun -host localhost -genvall -np 4 \
-genv I_MPI_PIN_DOMAIN auto \
-genv I_MPI_PIN_ORDER bunch \
-genv OMP_PROC_BIND TRUE \
/path/to/cp2k.psmp H2O-64.inp
Even the case with two ranks (and two GPUs) crashes:
mpirun -host localhost -genvall -np 2 \
-genv I_MPI_PIN_DOMAIN auto \
-genv I_MPI_PIN_ORDER bunch \
-genv OMP_PROC_BIND TRUE \
/path/to/cp2k.psmp H2O-64.inp
Specifically, I get a bunch of CUDA error: invalid argument errors, and subsequently usually acc_devmem_setzero crashes (in multiply_cannon).
we have not yet switched to his new
mp_get_node_global_rank()routine.
I believe this work must be about the local rank number. The global rank like GLOBAL_RANK_ID % NDEVICES is the way CP2K currently uses?
( As an addendum, the above mentioned multirun.sh also uses just global rank number, i.e., no awareness of the socket/PCIe-slot. )
I got H2O-64 test case to crash....
I can confirm this. On my dual GPU system cp2k crashes when run with more than one MPI rank and more than two threads.
The problem reaches back to at least 6d12638defcb1e6d69d826d4015702f810dba719, which makes me wonder why nobody has noticed this before.
I got H2O-64 test case to crash....
I can confirm this. On my dual GPU system cp2k crashes when run with more than one MPI rank and more than two threads.
The problem reaches back to at least 6d12638, which makes me wonder why nobody has noticed this before.
Running tests/dbcsr_perf from DBCSR itself using multiple ranks/threads does not seem to reproduce the "multi-GPU crashes". Therefore, CP2K seems to "drive" DBCSR differently than DBCSR's own reproducer/test. Further, I cannot see significant imbalance running tests/dbcsr_perf on multiple GPUs.
I just wanted to note, that some "CUDA quirks" is about thread-local active device. A number of CUDA functions implicitly refer to an "active device" similar to a global variable (but at least thread-local). If CP2K activates a device as per latest policy for handling (multiple) GPUs, it should be for every thread.
I also do not oversee what happens if the active device on a per-thread basis changes due to a different policy/scheme.