Replace SYCL 1.2.1 group barrier with SYCL 2020 alternative
https://github.com/oneapi-src/oneDPL/blob/470df99e5f27ab8da3ee55941ff9d6a9e0aa9730/include/oneapi/dpl/pstl/hetero/dpcpp/sycl_defs.h#L170
Please see details in https://github.com/intel/llvm/issues/12531 Please see details in https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#_group_broadcast
There are two major reasons of using the outdated barrier API:
- Performance. You can find the details here: https://github.com/intel/llvm/issues/12531.
- Non-compatible semantics, as mentioned in the code comments, which may require some work: https://github.com/oneapi-src/oneDPL/blob/470df99e5f27ab8da3ee55941ff9d6a9e0aa9730/include/oneapi/dpl/pstl/hetero/dpcpp/sycl_defs.h#L171-L174
I suppose that the most important reason is 1. However, oneDPL is claimed to be SYCL 2020 conformant, so SYCL 2020 group_barrier should be used, at least for the compilers other than oneAPI DPC++ compiler.
Let's finally clarify the question of semantics.
The __nd_item.barrier(sycl::access::fence_space::local_space) that is currently in use has the following semantics in SYCL 1.2.1:
Executes a work-group barrier with memory ordering on the local address space ... . The current work-item will wait at the barrier until all work-items in the current work-group have reached the barrier. In addition the barrier performs a fence operation ensuring that all memory accesses in the specified address space issued before the barrier complete before those issued after the barrier.
In other words, it serves as both a barrier for work items and as a memory fence (with unclear ordering semantics, but at least acquire-release as it seems from the description) for operations within local (i.e. work group) memory scope.
The work-group barrier sycl::group_barrier(__nd_item.get_group(), sycl::memory_scope::work_group) has the following semantics in SYCL 2020:
Synchronizes all work-items in a group. The current work-item will wait at the barrier until all work-items in the group have reached the barrier. In addition, the barrier performs mem-fence operations ensuring that memory accesses issued before the barrier are not re-ordered with those issued after the barrier: all work-items in the group execute a release fence prior to synchronizing at the barrier, all work-items in the group execute an acquire fence afterwards, and there is an implicit synchronization of these fences as if provided by an explicit atomic operation on an atomic object. By default, the scope of these fences is set to the narrowest scope including all work-items in the group ... This scope may be optionally overridden with a wider scope, specified by the
fence_scopeargument.
In other words, it serves as both the barrier and the acquire-release memory fence in the specified memory scope, which is the group scope sycl::memory_scope::work_group (and which is also the default for work-groups, so it can as well be implicit).
The differences I observe are:
- SYCL 2020 barrier allows to explicitly specify a group, while SYCL 1.2.1 barrier is always for the current work-group of an
nd_item. Of course usingget_group()aligns the barrier scope with that in 1.2.1. - SYCL 2020 barrier uses acquire-release memory ordering semantics, while SYCL 1.2.1 barrier is not clear about it. But as I said, it seems that its semantics is at least acquire-release as well, given the wording of "all memory accesses issued before the barrier complete before those issued after the barrier".
All in all, it seems that the new group_barrier can be used in the same way / with the same effect as the old one.
I think the comment telling about them being "not quite equivalent" is there either because earlier versions of SYCL 2020 did not provide enough clarity or because the barrier memory ordering semantics of 1.2.1 were confused with those of atomics, for which 1.2.1 only supported relaxed memory ordering. But I do not believe that the relaxed ordering would satisfy the described "complete before" requirement.
All in all, it seems that the new group_barrier can be used in the same way / with the same effect as the old one.
As far as I understand, another difference (and the reason for performance drop mentioned in the linked issues) is that the old version "Executes a work-group barrier with memory ordering on the local address space", while the new version affects all memory operations, in both local and global address spaces (but both only within work-group/local scope):
sycl::memory_scope::work_group The ordering constraint applies only to work-items in the same work-group as the calling work-item;
Thanks @al42and - indeed, this is an important difference that I missed, and that impacts performance.
If that difference is important for oneDPL code, then we should make it visible - either in the oneDPL wrapper name or maybe with a template parameter - that this barrier orders operations only for data in local memory but does not order global data accesses.
The comment https://github.com/intel/llvm/issues/12531#issuecomment-2370736542 refers to a device compiler bug that has been fixed (https://github.com/intel/intel-graphics-compiler/commit/ed639f68d142bc963a7b626badc207a42fb281cb) and that should improve performance.
Despite the subtle semantical difference, we still need to drop the use of the outdated API.
I tried to understand the impact of the replacement of SYCL 1.2.1 barrier with SYCL 2020 one. Below is a table with speed-ups (times) after switching to SYCL 2020 barrier when using different drivers: LTS - long-term support, and Rolling - "experimental" with the latest features. Less than 1 is a slow down (times). The selected algorithms use group barriers.
| Algorithm | LTS (2350.61) - Jun 2024 | Rolling (2441.21) - Nov 2024 | LTS (2350.125) - Dec 2024 |
|---|---|---|---|
| reduce | 1.00 | 1.01 | 0.99 |
| inclusive_scan | 0.99 | 1.00 | 0.99 |
| sort (merge) | 0.97 | 0.99 | 0.97 |
| sort (radix) | 0.96 | 0.99 | 0.98 |
| inclusive_scan_by_segment | 0.91 | 0.99 | 0.91 |
| reduce_by_segment | 0.99 | 1.00 | 0.99 |
| histogram (1024 bins) | 0.97 | 1.00 | 0.98 |
- oneAPI 2025.0, Ubuntu 24.04, Intel® Data Center GPU Max 1550, uint32_t
The fix is not available with LTS drivers. I think that it would be better to continue using SYCL 1.2.1 to avoid regression, and switch to SYCL 2020 later, e.g. when all LTS drivers listed here get that fix. We know that with DPC++ compiler still supports SYCL 1.2.1 barrier, but it is not applicable to other compilers, so SYCL 2020 is better to be used by default. This is what #1988 implements.
Just for note.
We are using now sycl::nd_item::barrier described at https://registry.khronos.org/SYCL/specs/sycl-1.2.1.pdf, 4.8.1.6 nd_item class :
void barrier(
access::fence_space accessSpace =
access::fence_space::global_and_local)const
Description: "Executes a work-group barrier with memory ordering on the local address space, global address space or both based on the value of accessSpace. The current work-item will wait at the barrier until all work-items in the current work-group have reached the barrier. In addition the barrier performs a fence operation ensuring that all memory accesses in the specified address space issued before the barrier complete before those issued after the barrier."
How fence_space defined in our SYCL implementation now:
namespace sycl {
inline namespace _V1 {
namespace access {
enum class fence_space {
local_space = 0,
global_space = 1,
global_and_local = 2
};
}
}
}
How we are using sycl::nd_item::barrier call in oneDPL code:
template <typename _Item>
constexpr void
__group_barrier(_Item __item)
{
#if 0 // !defined(_ONEDPL_LIBSYCL_VERSION) || _ONEDPL_LIBSYCL_VERSION >= 50300
//TODO: usage of sycl::group_barrier: probably, we have to revise SYCL parallel patterns which use a group_barrier.
// 1) sycl::group_barrier() implementation is not ready
// 2) sycl::group_barrier and sycl::item::group_barrier are not quite equivalent
sycl::group_barrier(__item.get_group(), sycl::memory_scope::work_group);
#else
__item.barrier(sycl::access::fence_space::local_space);
#endif
}
https://registry.khronos.org/SYCL/specs/sycl-1.2.1.pdf, 4.11 Synchronization and atomics : "A work-group barrier or work-group mem-fence may provide ordering semantics over the local address space, global address space or both. All memory operations initiated before the work-group barrier or work-group mem-fence operation in the specified address space(s) will be completed before any memory operation after the operation. Address spaces are specified using the fence_space enum class:"
namespace cl {
namespace sycl {
namespace access {
enum class fence_space : char {
local_space,
global_space,
global_and_local
}; // enum class fence_space
} // namepaces access
} // namespace sycl
} // namespace cl