dbcsr
dbcsr copied to clipboard
Explicitly reset the active GPU device
Currently, DBCSR assumes that the active GPU device never changes after dbcsr_init_lib()
has been called. With the arrival of more GPU accelerated libraries in CP2K this assumption will likely break. For example running with two K80 devices lead to several failures which are probably related to interference between SIRIUS and DBCSR.
That's a good idea... I will include it in v2.0
Back to this issue, currently we reset the GPU before each multiplication, see
https://github.com/cp2k/dbcsr/blob/f35f901e4460980aa06757294463a1e6308f8dc9/src/mm/dbcsr_mm.F#L429
Should we reset for errors before any GPU calls?
No, it should be enough to call dbcsr_acc_clear_errors()
only once when DBCSR gains control.
At the same time you should also call dbcsr_acc_set_active_device()
. Basically, don't assume that the GPU's state is preserved across different DBCSR calls.
Currently, dbcsr_acc_set_active_device()
is a public method, which is called by CP2K. However, the trend, e.g. with SIRIUS, seems to be that the libraries decide on their own how to assign GPUs to MPI ranks. I think, this is indeed the right way to go.