Tool-invoked fencing on ExecSpace Instance
Kokkos core GitHub Issue #6894 suggests that Kokkos Tools tool-invoked fencing could be a cause for the extraneous overhead of the Kokkos user function Kokkos::fence() as compared to the native device synchronization fence, e.g., CUDA device synchronize.
Whatever the degree of performance impact tool-invoked fencing has, the tool-invoked fence function code needs to be fixed so that the function acts across a finer-granularity in the execution space. Right now, the tool-invoked fence function takes in a devID and then does nothing with it, simply calling Kokkos::fence() with a specific string for the name parameter.
Fixing this is important because of inherent overhead incurred when fencing across a larger/coarser execution space, e.g., fencing across all threads on across a CPU and all GPU devices visible to Kokkos incurs more overhead than fencing within a particular GPU visible to Kokkos.
The fix likely needs a simple (4-5 line) change to lines 218-230 in:
https://github.com/kokkos/kokkos/blob/develop/core/src/impl/Kokkos_Profiling.cpp
Linking to https://github.com/kokkos/kokkos/issues/6894.
The fix likely needs a simple (4-5 line) change to lines 218-230 in:
https://github.com/kokkos/kokkos/blob/2035e313d7a54f9e1572eb5f315249ea841fb258/core/src/impl/Kokkos_Profiling.cpp#L218-L231
Whatever the degree of performance impact tool-invoked fencing has, the tool-invoked fence function code needs to be fixed so that the function acts across a finer-granularity in the execution space. Right now, the tool-invoked fence function takes in a devID and then does nothing with it, simply calling
Kokkos::fence()with a specific string for the name parameter.
That can be customized, right? Kokkos just sets that to Kokkos::fence initially. Also, this callback isn't widely used (only in sampler). Some tools set requires_global_fencing, though, but that flag doesn't invoke the fence callback but always Kokkos::fence.
I don't see how you would want to change the default fence callback. Can you elaborate on what you envision?
Kokkos core GitHub Issue #6894 suggests that Kokkos Tools tool-invoked fencing could be a cause for the extraneous overhead of the Kokkos user function
Kokkos::fence()as compared to the native device synchronization fence, e.g., CUDA device synchronize.
That issue asks if the implementation of execution space instances and global fences has significant overhead over the native versions (that, e.g., don't pass a std::string along). This is different from kokkos/kokkos@2035e31/core/src/impl/Kokkos_Profiling.cpp#L218-L231 but rather (for Cuda) https://github.com/kokkos/kokkos/blob/2035e313d7a54f9e1572eb5f315249ea841fb258/core/src/Cuda/Kokkos_Cuda_Instance.cpp#L138-L156.
@masterleinad
I guess I am not sure by your term customized.
You may be right that it's only needed for the sampler, but I think it is an enhancement that would further improve performance when using sampling (this hasn't been inhibitiing factor for sampling for experimentation done so far).
The idea of my fix is to make use of the bits of the devID into the tool_invoked_fence() . I would definitely need to extract the last 17 bits of that parameter (it's a uint32_t) to get the instance bits, and I think I would need the first 7 bits if we are running the Kokkos program in parallel across multiple GPUs. Then, I would use those instance bits as the basis for the execution space fence, i.e., execution_space.fence() rather than Kokkos::fence().
Note that in https://github.com/kokkos/kokkos-tools/blob/develop/common/kokkos-sampler/kp_sampler_skip.cpp, there is a function getDeviceID() that grabs the physical/visible device ID in a list of devices on a node. That's the first 7 bits and not solely what we need.
I went through this PR (by you, four years ago): https://github.com/kokkos/kokkos/pull/2672#pullrequestreview-347733349
and this helped me understand that this is a todo for Kokkos Tools, since I see at the top you that Kokkos_Profiling.cpp was not touched.
Stepping back though: The github issue in Kokkos core referenced seems to actually be suggesting the problem is something different involving Kokkos Tools, as you have suggested and as I have thought about on my own. Basically, for the implementation of every fence in a backend, there is profile_fence_event function wrapped around it, e.g., in the CUDA backend, the implementation of Kokkos::fence() involves profile_fence_event function taking in as a parameter cudaDeviceSynchronize. For that, I can't imagine there can be any overhead unless an actual tool connector is enabled, e.g., kp_kernel_logger. Note that the global (the term automatic is used interchangeably?) fence by the tool should be filtered out (See #156 ) But, I don't think this is the problem here, since no tool connector library function kokkosp_begin/end_fence() is being called.
Let me know if I have misunderstood something here. I agree that the issue you clarify in the second comment in this Issue seems to be orthogonal (though not unrelated broadly speaking) to the specific problem of this particular issue.