oneDPL Replace SYCL 1.2.1 group barrier with SYCL 2020 alternative

https://github.com/oneapi-src/oneDPL/blob/470df99e5f27ab8da3ee55941ff9d6a9e0aa9730/include/oneapi/dpl/pstl/hetero/dpcpp/sycl_defs.h#L170

Please see details in https://github.com/intel/llvm/issues/12531 Please see details in https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#_group_broadcast

Jul 08 '24 12:07 SergeyKopienko

There are two major reasons of using the outdated barrier API:

Performance. You can find the details here: https://github.com/intel/llvm/issues/12531.
Non-compatible semantics, as mentioned in the code comments, which may require some work: https://github.com/oneapi-src/oneDPL/blob/470df99e5f27ab8da3ee55941ff9d6a9e0aa9730/include/oneapi/dpl/pstl/hetero/dpcpp/sycl_defs.h#L171-L174

I suppose that the most important reason is 1. However, oneDPL is claimed to be SYCL 2020 conformant, so SYCL 2020 group_barrier should be used, at least for the compilers other than oneAPI DPC++ compiler.

Jul 09 '24 15:07 dmitriy-sobolev

Let's finally clarify the question of semantics.

The __nd_item.barrier(sycl::access::fence_space::local_space) that is currently in use has the following semantics in SYCL 1.2.1:

Executes a work-group barrier with memory ordering on the local address space ... . The current work-item will wait at the barrier until all work-items in the current work-group have reached the barrier. In addition the barrier performs a fence operation ensuring that all memory accesses in the specified address space issued before the barrier complete before those issued after the barrier.

In other words, it serves as both a barrier for work items and as a memory fence (with unclear ordering semantics, but at least acquire-release as it seems from the description) for operations within local (i.e. work group) memory scope.

The work-group barrier sycl::group_barrier(__nd_item.get_group(), sycl::memory_scope::work_group) has the following semantics in SYCL 2020:

Synchronizes all work-items in a group. The current work-item will wait at the barrier until all work-items in the group have reached the barrier. In addition, the barrier performs mem-fence operations ensuring that memory accesses issued before the barrier are not re-ordered with those issued after the barrier: all work-items in the group execute a release fence prior to synchronizing at the barrier, all work-items in the group execute an acquire fence afterwards, and there is an implicit synchronization of these fences as if provided by an explicit atomic operation on an atomic object. By default, the scope of these fences is set to the narrowest scope including all work-items in the group ... This scope may be optionally overridden with a wider scope, specified by the fence_scope argument.

In other words, it serves as both the barrier and the acquire-release memory fence in the specified memory scope, which is the group scope sycl::memory_scope::work_group (and which is also the default for work-groups, so it can as well be implicit).

The differences I observe are:

SYCL 2020 barrier allows to explicitly specify a group, while SYCL 1.2.1 barrier is always for the current work-group of an nd_item. Of course using get_group() aligns the barrier scope with that in 1.2.1.
SYCL 2020 barrier uses acquire-release memory ordering semantics, while SYCL 1.2.1 barrier is not clear about it. But as I said, it seems that its semantics is at least acquire-release as well, given the wording of "all memory accesses issued before the barrier complete before those issued after the barrier".

All in all, it seems that the new group_barrier can be used in the same way / with the same effect as the old one.

I think the comment telling about them being "not quite equivalent" is there either because earlier versions of SYCL 2020 did not provide enough clarity or because the barrier memory ordering semantics of 1.2.1 were confused with those of atomics, for which 1.2.1 only supported relaxed memory ordering. But I do not believe that the relaxed ordering would satisfy the described "complete before" requirement.

Jul 09 '24 16:07 akukanov

All in all, it seems that the new group_barrier can be used in the same way / with the same effect as the old one.

As far as I understand, another difference (and the reason for performance drop mentioned in the linked issues) is that the old version "Executes a work-group barrier with memory ordering on the local address space", while the new version affects all memory operations, in both local and global address spaces (but both only within work-group/local scope):

sycl::memory_scope::work_group The ordering constraint applies only to work-items in the same work-group as the calling work-item;

Jul 10 '24 12:07 al42and

Thanks @al42and - indeed, this is an important difference that I missed, and that impacts performance.

If that difference is important for oneDPL code, then we should make it visible - either in the oneDPL wrapper name or maybe with a template parameter - that this barrier orders operations only for data in local memory but does not order global data accesses.

Jul 10 '24 12:07 akukanov

The comment https://github.com/intel/llvm/issues/12531#issuecomment-2370736542 refers to a device compiler bug that has been fixed (https://github.com/intel/intel-graphics-compiler/commit/ed639f68d142bc963a7b626badc207a42fb281cb) and that should improve performance.

Despite the subtle semantical difference, we still need to drop the use of the outdated API.

Sep 25 '24 15:09 akukanov

I tried to understand the impact of the replacement of SYCL 1.2.1 barrier with SYCL 2020 one. Below is a table with speed-ups (times) after switching to SYCL 2020 barrier when using different drivers: LTS - long-term support, and Rolling - "experimental" with the latest features. Less than 1 is a slow down (times). The selected algorithms use group barriers.

Algorithm	LTS (2350.61) - Jun 2024	Rolling (2441.21) - Nov 2024	LTS (2350.125) - Dec 2024
reduce	1.00	1.01	0.99
inclusive_scan	0.99	1.00	0.99
sort (merge)	0.97	0.99	0.97
sort (radix)	0.96	0.99	0.98
inclusive_scan_by_segment	0.91	0.99	0.91
reduce_by_segment	0.99	1.00	0.99
histogram (1024 bins)	0.97	1.00	0.98

oneAPI 2025.0, Ubuntu 24.04, Intel® Data Center GPU Max 1550, uint32_t

The fix is not available with LTS drivers. I think that it would be better to continue using SYCL 1.2.1 to avoid regression, and switch to SYCL 2020 later, e.g. when all LTS drivers listed here get that fix. We know that with DPC++ compiler still supports SYCL 1.2.1 barrier, but it is not applicable to other compilers, so SYCL 2020 is better to be used by default. This is what #1988 implements.

Jan 05 '25 17:01 dmitriy-sobolev

Just for note. We are using now sycl::nd_item::barrier described at https://registry.khronos.org/SYCL/specs/sycl-1.2.1.pdf, 4.8.1.6 nd_item class :

void barrier(
access::fence_space accessSpace =
access::fence_space::global_and_local)const

Description: "Executes a work-group barrier with memory ordering on the local address space, global address space or both based on the value of accessSpace. The current work-item will wait at the barrier until all work-items in the current work-group have reached the barrier. In addition the barrier performs a fence operation ensuring that all memory accesses in the specified address space issued before the barrier complete before those issued after the barrier."

How fence_space defined in our SYCL implementation now:

namespace sycl {
inline namespace _V1 {
namespace access {

enum class fence_space {
  local_space = 0,
  global_space = 1,
  global_and_local = 2
};

}
}
}

How we are using sycl::nd_item::barrier call in oneDPL code:

template <typename _Item>
constexpr void
__group_barrier(_Item __item)
{
#if 0 // !defined(_ONEDPL_LIBSYCL_VERSION) || _ONEDPL_LIBSYCL_VERSION >= 50300
    //TODO: usage of sycl::group_barrier: probably, we have to revise SYCL parallel patterns which use a group_barrier.
    // 1) sycl::group_barrier() implementation is not ready
    // 2) sycl::group_barrier and sycl::item::group_barrier are not quite equivalent
    sycl::group_barrier(__item.get_group(), sycl::memory_scope::work_group);
#else
    __item.barrier(sycl::access::fence_space::local_space);
#endif
}

https://registry.khronos.org/SYCL/specs/sycl-1.2.1.pdf, 4.11 Synchronization and atomics : "A work-group barrier or work-group mem-fence may provide ordering semantics over the local address space, global address space or both. All memory operations initiated before the work-group barrier or work-group mem-fence operation in the specified address space(s) will be completed before any memory operation after the operation. Address spaces are specified using the fence_space enum class:"

namespace cl {
namespace sycl {
namespace access {
enum class fence_space : char {
local_space,
global_space,
global_and_local
}; // enum class fence_space
} // namepaces access
} // namespace sycl
} // namespace cl

Feb 07 '25 11:02 SergeyKopienko