dbcsr icon indicating copy to clipboard operation
dbcsr copied to clipboard

ACC init/finalize issues

Open hfp opened this issue 4 years ago • 12 comments

There are two potential issues with DBCSR's init/fini flow wrt ACC interface and LIBSMM:

  1. DBCSR calls acc_init(), which eventually calls libsmm_acc_init() internally (this is a potential issue since the backend (CUDA/HIP, or OpenCL, etc.) then depends on LIBSMM). The latter routine (libsmm_acc_init) was only recently introduced (everything here applies similarly to the respective finalize routine also). However, the expected flow (IMHO) is that DBCSR calls acc_init() and libsmm_acc_init() since both respective interfaces are used (and DBCSR does not judge potential internal depedency between ACC and LIBSMM implementation). Then, libsmm_acc_init() may call acc_init() internally if the ACC interface is used to implement LIBSMM (note, this dependency is somewhat expected). In turn, all init/finalize routines should be safe against multiple/over-initialization.
  2. DBCSR calls dbcsr_multiply_lib_init() inside of a parallel region, and subsequent (sub-)initialization may be called only by the master thread. For instance, acc_init() is called by the master thread only (OMP MASTER construct). However, acc_finalize() is apparently called by all threads of a parallel region (workshare). The latter is inconsistent with respect to a "tandem flow" of init/finalize. The expect behavior (IMHO) is to protect acc_init/finalize and libsmm_acc_init/finalize against multiple threads, i.e., ACC and LIBSMM implementations are only expected to be thread-safe after initialization and before finalization.

hfp avatar Jan 21 '21 08:01 hfp

ACK. Possibly related to #421?

dev-zero avatar Jan 21 '21 08:01 dev-zero

ACK. Possibly related to #421?

Could be, at least related in the sense of being an OpenMP/multicore inconsistency.

For OpenCL backend, the first issue from above (1) was unveiled by linker errors and weird dependency/use of libraries/objects in CMake, for the second issue (2) the expectation is/was made explicit (https://github.com/hfp/dbcsr/blob/openclacc/src/acc/opencl/acc_opencl.c#L133).

hfp avatar Jan 21 '21 09:01 hfp

@hfp I don't know if you've seen this, but I prefixed the full C api with c_dbcsr_* here: https://github.com/cp2k/dbcsr/pull/419 to avoid this issue:

The C api is now prefixed with c_dbsr_* because otherwise the Fortran code ended up calling acc_init in libgomp.so instead of the C-part of dbcsr.

haampie avatar Jan 28 '21 10:01 haampie

@haampie Thank you! I was not aware about this. Also, good to know if end up merging our work together.

One unrelated question (in case you have an AMD graphics card at hand): perhaps you can try/play with the OpenCL backend on AMD? Should work out-of-the-box (USE_ACCEL=opencl, etc.). Though, kernels are not sophisticated (maybe it looks like this because of the ifdef switches in the kernel-code; however DBCSR's Cuda kernels are much more advanced at the moment with more parameters). I cam curious if the OpenCL based LIBSMM is still performing reasonable.

For the trial, you may want to have a look at the OpenCL backend and OpenCL-based LIBSMM documentation. In particular, the auto-tuning guide should help to increase performance further.

hfp avatar Jan 28 '21 11:01 hfp

Hi @hfp, I'll give it a shot, but currently the cluster with AMD GPUs is unavailable, hopefully it's back next week.

haampie avatar Jan 29 '21 08:01 haampie

@haampie Thank you!

hfp avatar Jan 29 '21 09:01 hfp

Related to https://github.com/cp2k/dbcsr/issues/261

alazzaro avatar Feb 02 '21 20:02 alazzaro

@hfp Can we close this issue (and the corresponding #261)?

alazzaro avatar Feb 04 '21 14:02 alazzaro

Yes

hfp avatar Feb 04 '21 14:02 hfp

@hfp I went again to read this issue... My understanding is that the first point is fixed (acc_init). Note that there is issue #13 somehow related to this topic... But I now read the second bullet (init and finalize with different workflow). I assume this is still there, correct? If so, we should reopen this issue...

alazzaro avatar Feb 08 '21 08:02 alazzaro

There was also a fix necessary on my side, which was unveiled by @haampie 's changes. My CMake stuff applied -fopenmp (or equivalent) only to the LIBSMM source code but not the backend sources. It seems to me the respective init/fini functions are both called by the master thread from inside of parallel regions (which is symmetric in contrast to above claim). However, I have not checked this in DBCSR's sources.

There is still the question why we call init/fini functions in parallel regions (if we only want the master thread). Could be just DBCSR's convention (and hence fine). Although, this means the OpenCL BE cannot setup stuff for all threads like this ACC_OPENCL_THREADLOCAL_CONTEXT feature which replicates the context into each thread (it has not proven necessary or useful at this point). It could be a need later if revisiting wrt to multiple devices.

hfp avatar Feb 08 '21 09:02 hfp

Then let's reopen... I like the idea to have symmetry between init and finalize...

alazzaro avatar Feb 08 '21 09:02 alazzaro