oneDPL icon indicating copy to clipboard operation
oneDPL copied to clipboard

Adding transform_if algorithm and tests

Open smithc1 opened this issue 3 years ago • 1 comments

Implementation and tests for transform_if are added in order to implement scatter_if and gather_if algorithms. Tests for scatter, gather, scatter_if, and gather_if will be removed from this PR once they are merged into the CT repo.

transform_if interface:

template <typename Policy, typename InputIter1, typename InputIter2, typename OutputIter, typename UnaryOperation, typename Predicate> OutputIter transform_if(Policy&& policy, InputIter1 first, InputIter1 last, InputIter2 mask, OutputIter result, UnaryOperation op, Predicate pred);

transform_if semantics:

This transform_if interface conditionally applies a unary operation on each element of a source range and stores the result in the corresponding index of a result range if the corresponding index in the mask range satisfies a predicate. If the index in the mask range does not satisfy the predicate, then no change to the corresponding index in the result range occurs.

For each iterator i in the range [first, last) such that pred(*(mask + (i - first))) is true, the value op(*i) is assigned to *(result + (i - first)).

smithc1 avatar Jun 15 '21 16:06 smithc1

I'd suggest to rename the Summary of the PR

andreyfe1 avatar Jun 17 '21 08:06 andreyfe1

This patch is interesting but needs some more work. In particular, I think the semantics of transform_if should follow that of the regular transform algorithm, just extending it with the predicate. I.e. there should be two overloads:

template <typename Policy, typename InputIter, typename OutputIter, typename UnaryOperation, typename UnaryPredicate>
OutputIter transform_if(Policy&& policy, InputIter first, InputIter last, OutputIter result, UnaryOperation op, UnaryPredicate pred);

template <typename Policy, typename InputIter1, typename InputIter2, typename OutputIter, typename BinaryOperation, typename BinaryPredicate>
OutputIter transform_if(Policy&& policy, InputIter1 first, InputIter1 last, InputIter2 first2, OutputIter result, BinaryOperation op, BinaryPredicate pred);

The second overload would be used in the implementation of gather_if / scatter_if, with the predicate that only checks the mask (the second input) and the transformation that, on the opposite, ignores the mask and only takes the first input. Alternatively, the unary form can be used in the combination with zip_iterator.

akukanov avatar Nov 24 '22 16:11 akukanov

@akukanov

This patch is interesting but needs some more work. In particular, I think the semantics of transform_if should follow that of the regular transform algorithm, just extending it with the predicate.

As mentioned in the update to the description, I've refactored this to match the API of transform with an added predicate, and adjusted the implementation to reuse existing code patterns/bricks.

danhoeflinger avatar May 03 '23 16:05 danhoeflinger

@danhoeflinger is it possible to fix clang format here as part of our common process? Or current format is ok?

SergeyKopienko avatar May 25 '23 07:05 SergeyKopienko

I believe that we dont need up to 90% of these changes. Because a transform_if semantic is trivially implementing by a call transform algorithm with composition a passed transform functor and condition predicate int one functor which is passing to the current oneDPL transform algorithm. Also the current test "transform" can be extended by an additional call "transform_if", and we don't need to add new test for that.

@MikeDvorskiy I don't believe that this can be trivially implemented by the current transform algorithm. The fundamental difference is that transform always writes the output of a unary or binary functor to the output range. transform_if only writes to the output range for elements which have satisfied the predicate. If the predicate is false, the original value of the output element is preserved. This is the reason for the change to allow __pattern_walk_3 to accept read_write access_mode where it previously hard coded a write only access mode for the output range.

Internally transform and transform_if use a lot of similar infrastructure (__pattern_walk*), but have that fundamental difference. I believe the easiest way to implement transform_if with existing APIs is via std::for_each, and zip_iterators and custom functor composition.

danhoeflinger avatar May 25 '23 12:05 danhoeflinger

I believe that we dont need up to 90% of these changes. Because a transform_if semantic is trivially implementing by a call transform algorithm with composition a passed transform functor and condition predicate int one functor which is passing to the current oneDPL transform algorithm. Also the current test "transform" can be extended by an additional call "transform_if", and we don't need to add new test for that.

@MikeDvorskiy I don't believe that this can be trivially implemented by the current transform algorithm. The fundamental difference is that transform always writes the output of a unary or binary functor to the output range. transform_if only writes to the output range for elements which have satisfied the predicate. If the predicate is false, the original value of the output element is preserved. This is the reason for the change to allow __pattern_walk_3 to accept read_write access_mode where it previously hard coded a write only access mode for the output range.

Internally transform and transform_if use a lot of similar infrastructure (__pattern_walk*), but have that fundamental difference. I believe the easiest way to implement transform_if with existing APIs is via std::for_each, and zip_iterators and custom functor composition.

I see... But in any case you can re-use oneapi::dpl::__internal::__pattern_walk2 and oneapi::dpl::__internal::__pattern_walk2 by passing special functor with conditional assignment. So, it is enough to define a couple of functions like that: https://github.com/oneapi-src/oneDPL/blob/main/include/oneapi/dpl/pstl/glue_algorithm_impl.h#L332 https://github.com/oneapi-src/oneDPL/blob/main/include/oneapi/dpl/pstl/glue_algorithm_impl.h#L347

Or other words, it seems we dont need new patterns __pattern_walk2_transform_if and __pattern_walk3_transform_if.

MikeDvorskiy avatar May 25 '23 13:05 MikeDvorskiy

I see... But in any case you can re-use oneapi::dpl::__internal::__pattern_walk2 and oneapi::dpl::__internal::__pattern_walk2 by passing special functor with conditional assignment. So, it is enough to define a couple of functions like that: https://github.com/oneapi-src/oneDPL/blob/main/include/oneapi/dpl/pstl/glue_algorithm_impl.h#L332 https://github.com/oneapi-src/oneDPL/blob/main/include/oneapi/dpl/pstl/glue_algorithm_impl.h#L347

Or other words, it seems we dont need new patterns __pattern_walk2_transform_if and __pattern_walk3_transform_if.

@MikeDvorskiy As described offline, before this PR, our range implementation only copies host iterator data to the device prior to a kernel when read or read_write access mode is requested. For transform_if, we don't need read access for our output range, but we do need to copy the output range to the device prior to the kernel to prevent untouched elements from being overwritten by uninitialized data on the copy back when the predicate is false.

We do this by adding a template parameter, _ForceCopyDirectForHostIter, to __get_sycl_range which forces this copy (for host iterators) without requiring read_write access mode in the kernel itself.

This is sycl implementation specific, so to communicate that extra requirement to __pattern_walk[2,3], we need another layer specific to transform_if in include/oneapi/dpl/pstl/hetero/algorithm_impl_hetero.h. This is why we need __pattern_walk[2,3]_transform_if.

danhoeflinger avatar May 25 '23 20:05 danhoeflinger

For transform_if, we don't need read access for our output range

Just "write" means read as well. I clarified it some time ago... Dont remember which way... probably, SYCL spec + discussion with guys who are responsible for SYCL language.

To be honest, I don't understand the problem with host iterator and "copy back" feature, which is a kind of implementation details.

So, summary: We have a couple of common middle layer algo patterns - pattern_walk2 and pattern_walk3.

  • pattern_walk2 by default accepts the first sequence for reading and the second one for writing (reading and writing).
  • pattern_walk3 by default accepts the first sequence and the second one for reading and the third one for writing (reading and writing). For USM it mode accesses don't matter, right.? For SYCL buffers => we have correct mode access by default, right? For the host iterators: The input sequences are being copied just one direction - to device. The output sequence is being copied just one direction - to the host.

Which my sentence is not correct? and where we have an issue here?

BTW, other idea = to try just transform with conditinal_transform functor and output_transform iterator with std::ignore

MikeDvorskiy avatar May 26 '23 12:05 MikeDvorskiy

Just "write" means read as well. I clarified it some time ago... Dont remember which way... probably, SYCL spec + discussion with guys who are responsible for SYCL language.

Yes you are correct, I just confirmed this with James Brodman, write implies a copy in of the data for SYCL buffers unless property::no_init is also specified.

However, before this PR, our host_iterator specialization of __get_sycl_range does not initialize the buffer with incoming data when write access mode is requested (see copy_direct_tag below):

https://github.com/oneapi-src/oneDPL/blob/07de70c504282266b7d5796c808e787b70e471fb/include/oneapi/dpl/pstl/hetero/dpcpp/utils_ranges_sycl.h#L502-L526

Therefore, for incoming host iterators:

  • For the host iterators: The input sequences are being copied just one direction - to device. The output sequence is being copied just one direction - to the host.

This is the problem ^. We need to copy the output sequence to the device as well. SYCL does this with buffers in write access mode, but we don't for incoming host iterators.

For instance, if we have a predicate that always returns false, all initial data in the output sequence is the appropriate final output. We are copying back to the host, so we need to copy initial output sequence to the device, otherwise we will overwrite with it uninitialized data.

If we make the change to always copy host iterators in with write access mode, it will be a slowdown in all patterns which use that, so I opted to only do it for transform_if where it is explicitly required.

General note: We may want to consider using property::no_init in more cases to avoid this copy in.

danhoeflinger avatar May 26 '23 12:05 danhoeflinger

other idea = to try just transform with conditinal_transform functor and output_transform iterator with std::ignore

We don't have an output_transform iterator in oneDPL, currently only in SYCLomatic headers. Also, this would not change the need for the layer in the hetero backend to force the copy in of host_iterator output sequences into the buffer in the hetero backend. I don't think we can get around that (unless we start copying in all host_iterators in write mode).

danhoeflinger avatar May 26 '23 13:05 danhoeflinger

The important takeaways are this:

  • write and read_write are very similar or the same for SYCL kernel optimization.
  • We currently don't match SYCL spec with regard to host_iterators and their copy_in behavior dictated by access mode.

With all this in mind, I think the best way forward in this PR is to go back to requiring read_write access mode for transform_if with a TODO to re-evaluate the __get_sycl_range function to also have an optional no_init property to align with the sycl spec. Revert my changes to explicitly force copy in for host iterator because it seems that this will not hurt from an optimization perspective, and is more aligned with the structure of the SYCL spec.

In a future PR, we can add the no_init property to __get_sycl_range, and convert the usages of the __get_sycl_range with write access mode to explicitly have write with no_init where applicable. We can then apply the same rules as the SYCL spec to our host_iterator specialization, copying in sequences with write access and without no_init property. That will allow us to use the write access mode for transform_if, but we will also want to change transform to write with no_init.

In the end, we should have some performance boost from not requiring copy in for kernels where it is unnecessary.

danhoeflinger avatar May 26 '23 13:05 danhoeflinger

After some offline discussion with @MikeDvorskiy and @rarutyun, it has been decided that this PR will be delayed until after a discussion of iterators from std::vector (host iterators).
The use of host iterators and a hetero policy with transform_if will have disappointing performance because of the need to copy the output sequence to the device prior to the kernel (and then copy back after). It seems that these copies are unavoidable, but the question is whether the use of this API with host iterators should be allowed at all.

danhoeflinger avatar May 26 '23 19:05 danhoeflinger

The use of host iterators and a hetero policy with transform_if will have disappointing performance because of the need to copy the output sequence to the device prior to the kernel (and then copy back after). It seems that these copies are unavoidable, but the question is whether the use of this API with host iterators should be allowed at all.

Yes, copying seems unavoidable. I thought of some "smart" wrapper similar to std::optional which could avoid rewriting elements that were not updated on the device; but that would not work either, as such a type would not be device-copyable.

Yet - yes, if we introduce such an algorithm, we do need it to work with all the same input containers as the others. Why do we expect it to have "disappointing " performance? Would not be a SYCL buffer built on top of a std::vector have the same performance?

If we want to avoid extra copying in the described scenario, we could try using a separate buffer of std::optionals, ensure that the implementation initializes all of them on the device (of course to an empty state for elements that do not match the predicate), and then on the host have a separate loop that copies data back to the original sequence omitting the empty elements.

akukanov avatar May 31 '23 13:05 akukanov

@akukanov

Why do we expect it to have "disappointing " performance?

Basically, with an output host iterator we need to copy to the device prior to the kernel and back after the kernel, where many APIs would just require the copy back. However, I don't think this is different from something like for_each used in a similar way.

Would not be a SYCL buffer built on top of a std::vector have the same performance?

It would be the same assuming that the buffer data started on the host before the kernel and was requested again on the host after the kernel. However, a user created SYCL buffer the option to possibly keep the data on the device after the kernel. Using output host iterators explicitly copies in and out no matter the surrounding usage.


If the transform is computationally expensive enough to warrant the transfers, the performance becomes more reasonable. The challenge may be more about documentation and making users aware of what they are doing when directly using host iterators than about implementation of the algorithm.

I'm interested in the temporary std::optional approach you mention. It is something worth considering. I think this would be a specialization specifically for host iterators, since we don't manage the location of data in SYCL buffers or USM memory. We would basically be adding a host-side par_unseq transform_if where the functor is a copy from the temp to the output and the predicate is checking the std::optional for a value, however it removes one direction of a memory transfer to the device.

For transform_if calls with simple functors and predicates using host iterators, the user would still likely have better performance just using a par_unseq policy, but this would mitigate the harm in part.

danhoeflinger avatar May 31 '23 14:05 danhoeflinger

Probably, it makes sense to use dpl::for_each + zip(data1, data2, res) + functor with predicate? In that case we have just one access mode for all sequencies - read_write.

MikeDvorskiy avatar Nov 13 '23 18:11 MikeDvorskiy

Probably, it makes sense to use dpl::for_each + zip(data1, data2, res) + functor with predicate? In that case we have just one access mode for all sequencies - read_write.

@MikeDvorskiy For host_iterator inputs, wouldn't that add an unnecessary copy-back for data1 and data2 which were previously read (copy-in) only?

danhoeflinger avatar Nov 13 '23 18:11 danhoeflinger

For host_iterator inputs, wouldn't that add an unnecessary copy-back for data1 and data2 which were previously read (copy-in) only?

right.... In that case - "transform" + read_write mode for "result".

MikeDvorskiy avatar Nov 14 '23 12:11 MikeDvorskiy

BTW, my general thoughts. Looking at a scenario like this wider - actually we need a way(approach) to "transfer data by mask" between device and host. If we consider an approach when we do copy data to a device and back explicitly, we might implement such copying by mask in a kernel, for example.

MikeDvorskiy avatar Nov 14 '23 13:11 MikeDvorskiy

BTW, my general thoughts. Looking at a scenario like this wider - actually we need a way(approach) to "transfer data by mask" between device and host. If we consider an approach when we do copy data to a device and back explicitly, we might implement such copying by mask in a kernel, for example.

I'm not sure we gain much benefit from this in the "random" case, only in the cases where we have large contiguous blocks of untouched memory (and only with host_iterator output). Also, I'm not sure how one writes to host memory in a kernel unless you are talking about the kernel being a conversion of a device buffer to a contiguous sparse representation to then transfer to the host to then be unpacked.

danhoeflinger avatar Nov 14 '23 13:11 danhoeflinger

For host_iterator inputs, wouldn't that add an unnecessary copy-back for data1 and data2 which were previously read (copy-in) only?

right.... In that case - "transform" + read_write mode for "result".

That is effectively what this PR does. transform calls __pattern_walk2 or __pattern_walk3 based on the overload. transform_if calls the same underlying patterns, but delays the call of that pattern until the hetero backend because that is the only place where adjustment of the access mode is available / necessary.

danhoeflinger avatar Nov 14 '23 13:11 danhoeflinger