oneDPL icon indicating copy to clipboard operation
oneDPL copied to clipboard

Align `__get_sycl_range` with SYCL runtime in it's treatment of `write` access mode and `no_init{}`

Open danhoeflinger opened this issue 2 years ago • 2 comments

Currently __get_sycl_range with a write access mode and a host_iterator input does not include a "copy-in" for data. The copy-in is skipped, as the sycl::buffer is not initialized with the host_iterator's data, only its number of elements.

This currently is not aligned with the SYCL runtime, where write implies a "copy-in" unless a no_init{} property is added to the accessor creation. We should add a no_init{} property or parameter to __get_sycl_range so that we can be in line with the SYCL runtime.
We should examine all current write and read_write access mode calls to __get_sycl_range, and determine where a copy in is required and where it is not, and use the new feature accordingly.

From #276:

// TODO: add optional template parameter: no_init to match sycl property::no_init.  This can be used to allow accessors
// in `write` or `read_write` mode to not copy in data prior to a kernel where it is not required.  This will also allow
// us to align with the SYCL spec on how we treat host_iterators.  We currently only copy in when `read` or `read_write`
// access mode is specified, not for `write`.  If we add a `no_init` property, we can explicitly use it where
// it applies, and operate under the same rules as SYCL with copying host_iterators.

danhoeflinger avatar Nov 13 '23 17:11 danhoeflinger

Transform_if uses read_write here to force a copy-in currently, even though read access isn't really required. What it wants is write without no_init. This becomes more of a problem in #1976.

danhoeflinger avatar Jan 24 '25 02:01 danhoeflinger

Providing some detailed context on the issue @danhoeflinger is referring to above:

Problem

https://github.com/uxlfoundation/oneDPL/pull/1976 introduces vectorized implementation paths for parallel-for based algorithms. These implementations explicitly load and store data to and from local arrays in order to generate vectorized instructions for performance purposes.

One of our patterns that utilizes this vectorized implementation is __pattern_walk2. Consider its signature:

template <typename _WaitMode = __par_backend_hetero::__deferrable_mode,
          __par_backend_hetero::access_mode __acc_mode1 = __par_backend_hetero::access_mode::read,
          __par_backend_hetero::access_mode __acc_mode2 = __par_backend_hetero::access_mode::write,
          typename _BackendTag, typename _ExecutionPolicy, typename _ForwardIterator1, typename _ForwardIterator2,
          typename _Function>
_ForwardIterator2
__pattern_walk2(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _ForwardIterator1 __first1,
                _ForwardIterator1 __last1, _ForwardIterator2 __first2, _Function __f)

Note that two template types __acc_mode1 and __acc_mode2 are defaulted to read and write. This is due to the semantics of the C++ algorithm library where the first sequence is most commonly read from and the last sequence is written to (e.g. std::copy).

__pattern_walk3, a similar pattern, shows an identical trend but applied to three iterator sequences where the defaulted access modes are read, read, write (e.g. as seen in binary transform overload):

template <typename _BackendTag, __par_backend_hetero::access_mode __acc_mode1 = __par_backend_hetero::access_mode::read,
          __par_backend_hetero::access_mode __acc_mode2 = __par_backend_hetero::access_mode::read,
          __par_backend_hetero::access_mode __acc_mode3 = __par_backend_hetero::access_mode::write,
          typename _ExecutionPolicy, typename _ForwardIterator1, typename _ForwardIterator2, typename _ForwardIterator3,
          typename _Function>
_ForwardIterator3
__pattern_walk3(__hetero_tag<_BackendTag>, _ExecutionPolicy&& __exec, _ForwardIterator1 __first1,
                _ForwardIterator1 __last1, _ForwardIterator2 __first2, _ForwardIterator3 __first3, _Function __f)

The vectorized implementations in the mentioned PR take advantage of this common behavior by only vectorizing loads of all but the last sequence and only outputting the last sequence. This can be seen from the vectorized implementation of walk3_vectors_or_scalars which vectorizes loads of the first two iterator sequences (internally ranges) and only outputs to the global last range:

// 1. Load inputs into vectors
oneapi::dpl::__par_backend_hetero::__vector_load<__base_t::__preferred_vector_size>{__n}(
    __is_full, __idx, oneapi::dpl::__par_backend_hetero::__scalar_load_op{}, __rng1, __rng1_vector);
oneapi::dpl::__par_backend_hetero::__vector_load<__base_t::__preferred_vector_size>{__n}(
    __is_full, __idx, oneapi::dpl::__par_backend_hetero::__scalar_load_op{}, __rng2, __rng2_vector);
// 2. Apply binary functor to vector and store into global memory
oneapi::dpl::__par_backend_hetero::__vector_store<__base_t::__preferred_vector_size>{__n}(
    __is_full, __idx, oneapi::dpl::__par_backend_hetero::__scalar_store_transform_op<_F>{__f}, __rng1_vector,
    __rng2_vector, __rng3);

A notable exception to this is std::swap_ranges which has been special handled. Logically, this produces correct results throughout current implementations of all oneDPL algorithms, but the templated access types still remain due to necessity and creates a dangerous scenario for future uses.

Currently, there are two cases that use different access mode types than the default templates. Firstly, in __pattern_unique.

    // The temporary buffer is constructed from a range, therefore it's destructor will not block, therefore
    // we must call __pattern_walk2 in a way which provides blocking synchronization for this pattern.
    return __pattern_walk2</*_WaitMode*/ __par_backend_hetero::__deferrable_mode,
                           __par_backend_hetero::access_mode::read_write,
                           __par_backend_hetero::access_mode::read_write>(
        __tag, __par_backend_hetero::make_wrapped_policy<copy_back_wrapper>(::std::forward<_ExecutionPolicy>(__exec)),
        __copy_first, __copy_last, __first, __brick_copy<__hetero_tag<_BackendTag>, _ExecutionPolicy>{});

The first provided access mode is __par_backend_hetero::access_mode::read_write. We never write to the first sequence but my understanding is this is required due to data dependencies with previous kernel submissions. This needs to be further investigated.

Secondly, both __pattern_walk2_transform_if and __pattern_walk3_transform_if pass the last output sequence as read_write in order to maintain the values of non-transformed elements in the case input is passed via a host vector:

// Require `read_write` access mode for output sequence to force a copy in for host iterators to capture incoming
// values of the output sequence for elements where the predicate is false as they must be unmodified in the final output.
return __pattern_walk3<_BackendTag, __par_backend_hetero::access_mode::read,
                       __par_backend_hetero::access_mode::read, __par_backend_hetero::access_mode::read_write>(
    __tag,
    __par_backend_hetero::make_wrapped_policy<__walk3_transform_if_wrapper>(
        ::std::forward<_ExecutionPolicy>(__exec)),
    __first1, __last1, __first2, __first3, __func);

These two instances currently necessitate having templated access mode types even if the actual use in the kernel does not align with the mode. This creates a confusing and dangerous case for future development work. If users need to write inputs to a sequence that is not the last, __pattern_walk2 and __pattern_walk3 will produces incorrect results through vector paths even with the correct access mode as the values will never be written to global memory.

Similarly, in the case in which a read access mode is provided to the last sequence and we do indeed read the input, we may run into performance issues as its loads are not vectorized.

None of these cases currently exist but the implementation as it stands creates a "trap" due to these misleading semantics.

Potential solutions to the __pattern_walk2 / __pattern_walk3 problem

Firstly, as mentioned in this issue, supporting write access mode without no_init would resolve part of this issue and would allow transform_if to replace its read_write mode with write without the no_init property.

One such complete option would be to pass access modes to bricks and leverage that data to determine which sequences to load to vectors and store to global memory. Alternatively, a minimal approach would be to figure how we can limit __pattern_walk2 and __pattern_walk3 access modes to the defaults if possible to align with the current vector path's semantics. Understanding the need for the unique cases' first sequence read_write mode should first be understood.

mmichel11 avatar Jan 24 '25 21:01 mmichel11