dbcsr Explicitly reset the active GPU device

Explicitly reset the active GPU device

Open oschuett opened this issue 5 years ago • 3 comments

Currently, DBCSR assumes that the active GPU device never changes after dbcsr_init_lib() has been called. With the arrival of more GPU accelerated libraries in CP2K this assumption will likely break. For example running with two K80 devices lead to several failures which are probably related to interference between SIRIUS and DBCSR.

Jul 08 '19 17:07 oschuett

That's a good idea... I will include it in v2.0

Jul 08 '19 18:07 alazzaro

Back to this issue, currently we reset the GPU before each multiplication, see

https://github.com/cp2k/dbcsr/blob/f35f901e4460980aa06757294463a1e6308f8dc9/src/mm/dbcsr_mm.F#L429

Should we reset for errors before any GPU calls?

Nov 20 '19 17:11 alazzaro

No, it should be enough to call dbcsr_acc_clear_errors() only once when DBCSR gains control. At the same time you should also call dbcsr_acc_set_active_device(). Basically, don't assume that the GPU's state is preserved across different DBCSR calls.

Currently, dbcsr_acc_set_active_device() is a public method, which is called by CP2K. However, the trend, e.g. with SIRIUS, seems to be that the libraries decide on their own how to assign GPUs to MPI ranks. I think, this is indeed the right way to go.

Nov 20 '19 17:11 oschuett

dbcsr dbcsr copied to clipboard

Explicitly reset the active GPU device

dbcsr
dbcsr copied to clipboard