dbcsr
dbcsr copied to clipboard
ACC init/finalize issues
There are two potential issues with DBCSR's init/fini flow wrt ACC interface and LIBSMM:
- DBCSR calls
acc_init()
, which eventually callslibsmm_acc_init()
internally (this is a potential issue since the backend (CUDA/HIP, or OpenCL, etc.) then depends on LIBSMM). The latter routine (libsmm_acc_init
) was only recently introduced (everything here applies similarly to the respective finalize routine also). However, the expected flow (IMHO) is that DBCSR callsacc_init()
andlibsmm_acc_init()
since both respective interfaces are used (and DBCSR does not judge potential internal depedency between ACC and LIBSMM implementation). Then,libsmm_acc_init()
may callacc_init()
internally if the ACC interface is used to implement LIBSMM (note, this dependency is somewhat expected). In turn, all init/finalize routines should be safe against multiple/over-initialization. - DBCSR calls
dbcsr_multiply_lib_init()
inside of a parallel region, and subsequent (sub-)initialization may be called only by the master thread. For instance,acc_init()
is called by the master thread only (OMP MASTER construct). However,acc_finalize()
is apparently called by all threads of a parallel region (workshare). The latter is inconsistent with respect to a "tandem flow" of init/finalize. The expect behavior (IMHO) is to protect acc_init/finalize and libsmm_acc_init/finalize against multiple threads, i.e., ACC and LIBSMM implementations are only expected to be thread-safe after initialization and before finalization.
ACK. Possibly related to #421?
ACK. Possibly related to #421?
Could be, at least related in the sense of being an OpenMP/multicore inconsistency.
For OpenCL backend, the first issue from above (1) was unveiled by linker errors and weird dependency/use of libraries/objects in CMake, for the second issue (2) the expectation is/was made explicit (https://github.com/hfp/dbcsr/blob/openclacc/src/acc/opencl/acc_opencl.c#L133).
@hfp I don't know if you've seen this, but I prefixed the full C api with c_dbcsr_* here: https://github.com/cp2k/dbcsr/pull/419 to avoid this issue:
The C api is now prefixed with c_dbsr_* because otherwise the Fortran code ended up calling acc_init in libgomp.so instead of the C-part of dbcsr.
@haampie Thank you! I was not aware about this. Also, good to know if end up merging our work together.
One unrelated question (in case you have an AMD graphics card at hand): perhaps you can try/play with the OpenCL backend on AMD? Should work out-of-the-box (USE_ACCEL=opencl, etc.). Though, kernels are not sophisticated (maybe it looks like this because of the ifdef switches in the kernel-code; however DBCSR's Cuda kernels are much more advanced at the moment with more parameters). I cam curious if the OpenCL based LIBSMM is still performing reasonable.
For the trial, you may want to have a look at the OpenCL backend and OpenCL-based LIBSMM documentation. In particular, the auto-tuning guide should help to increase performance further.
Hi @hfp, I'll give it a shot, but currently the cluster with AMD GPUs is unavailable, hopefully it's back next week.
@haampie Thank you!
Related to https://github.com/cp2k/dbcsr/issues/261
@hfp Can we close this issue (and the corresponding #261)?
Yes
@hfp I went again to read this issue... My understanding is that the first point is fixed (acc_init). Note that there is issue #13 somehow related to this topic... But I now read the second bullet (init and finalize with different workflow). I assume this is still there, correct? If so, we should reopen this issue...
There was also a fix necessary on my side, which was unveiled by @haampie 's changes. My CMake stuff applied -fopenmp
(or equivalent) only to the LIBSMM source code but not the backend sources. It seems to me the respective init/fini functions are both called by the master thread from inside of parallel regions (which is symmetric in contrast to above claim). However, I have not checked this in DBCSR's sources.
There is still the question why we call init/fini functions in parallel regions (if we only want the master thread). Could be just DBCSR's convention (and hence fine). Although, this means the OpenCL BE cannot setup stuff for all threads like this ACC_OPENCL_THREADLOCAL_CONTEXT feature which replicates the context into each thread (it has not proven necessary or useful at this point). It could be a need later if revisiting wrt to multiple devices.
Then let's reopen... I like the idea to have symmetry between init and finalize...