oneDPL
oneDPL copied to clipboard
Adding transform_if algorithm and tests
Implementation and tests for transform_if are added in order to implement scatter_if and gather_if algorithms. Tests for scatter, gather, scatter_if, and gather_if will be removed from this PR once they are merged into the CT repo.
transform_if interface:
template <typename Policy, typename InputIter1, typename InputIter2, typename OutputIter, typename UnaryOperation, typename Predicate> OutputIter transform_if(Policy&& policy, InputIter1 first, InputIter1 last, InputIter2 mask, OutputIter result, UnaryOperation op, Predicate pred);
transform_if semantics:
This transform_if interface conditionally applies a unary operation on each element of a source range and stores the result in the corresponding index of a result range if the corresponding index in the mask range satisfies a predicate. If the index in the mask range does not satisfy the predicate, then no change to the corresponding index in the result range occurs.
For each iterator i in the range [first, last) such that pred(*(mask + (i - first))) is true, the value op(*i) is assigned to *(result + (i - first)).
I'd suggest to rename the Summary of the PR
This patch is interesting but needs some more work. In particular, I think the semantics of transform_if should follow that of the regular transform algorithm, just extending it with the predicate. I.e. there should be two overloads:
template <typename Policy, typename InputIter, typename OutputIter, typename UnaryOperation, typename UnaryPredicate>
OutputIter transform_if(Policy&& policy, InputIter first, InputIter last, OutputIter result, UnaryOperation op, UnaryPredicate pred);
template <typename Policy, typename InputIter1, typename InputIter2, typename OutputIter, typename BinaryOperation, typename BinaryPredicate>
OutputIter transform_if(Policy&& policy, InputIter1 first, InputIter1 last, InputIter2 first2, OutputIter result, BinaryOperation op, BinaryPredicate pred);
The second overload would be used in the implementation of gather_if / scatter_if, with the predicate that only checks the mask (the second input) and the transformation that, on the opposite, ignores the mask and only takes the first input. Alternatively, the unary form can be used in the combination with zip_iterator.
@akukanov
This patch is interesting but needs some more work. In particular, I think the semantics of transform_if should follow that of the regular transform algorithm, just extending it with the predicate.
As mentioned in the update to the description, I've refactored this to match the API of transform
with an added predicate, and adjusted the implementation to reuse existing code patterns/bricks.
@danhoeflinger is it possible to fix clang format here as part of our common process? Or current format is ok?
I believe that we dont need up to 90% of these changes. Because a
transform_if
semantic is trivially implementing by a calltransform
algorithm with composition a passedtransform functor
andcondition predicate
int one functor which is passing to the current oneDPLtransform
algorithm. Also the current test "transform" can be extended by an additional call "transform_if", and we don't need to add new test for that.
@MikeDvorskiy
I don't believe that this can be trivially implemented by the current transform
algorithm. The fundamental difference is that transform
always writes the output of a unary or binary functor to the output range. transform_if
only writes to the output range for elements which have satisfied the predicate. If the predicate is false, the original value of the output element is preserved. This is the reason for the change to allow __pattern_walk_3
to accept read_write
access_mode where it previously hard coded a write only access mode for the output range.
Internally transform
and transform_if
use a lot of similar infrastructure (__pattern_walk*
), but have that fundamental difference. I believe the easiest way to implement transform_if
with existing APIs is via std::for_each
, and zip_iterators
and custom functor composition.
I believe that we dont need up to 90% of these changes. Because a
transform_if
semantic is trivially implementing by a calltransform
algorithm with composition a passedtransform functor
andcondition predicate
int one functor which is passing to the current oneDPLtransform
algorithm. Also the current test "transform" can be extended by an additional call "transform_if", and we don't need to add new test for that.@MikeDvorskiy I don't believe that this can be trivially implemented by the current
transform
algorithm. The fundamental difference is thattransform
always writes the output of a unary or binary functor to the output range.transform_if
only writes to the output range for elements which have satisfied the predicate. If the predicate is false, the original value of the output element is preserved. This is the reason for the change to allow__pattern_walk_3
to acceptread_write
access_mode where it previously hard coded a write only access mode for the output range.Internally
transform
andtransform_if
use a lot of similar infrastructure (__pattern_walk*
), but have that fundamental difference. I believe the easiest way to implementtransform_if
with existing APIs is viastd::for_each
, andzip_iterators
and custom functor composition.
I see... But in any case you can re-use oneapi::dpl::__internal::__pattern_walk2 and oneapi::dpl::__internal::__pattern_walk2 by passing special functor with conditional assignment. So, it is enough to define a couple of functions like that: https://github.com/oneapi-src/oneDPL/blob/main/include/oneapi/dpl/pstl/glue_algorithm_impl.h#L332 https://github.com/oneapi-src/oneDPL/blob/main/include/oneapi/dpl/pstl/glue_algorithm_impl.h#L347
Or other words, it seems we dont need new patterns __pattern_walk2_transform_if and __pattern_walk3_transform_if.
I see... But in any case you can re-use oneapi::dpl::__internal::__pattern_walk2 and oneapi::dpl::__internal::__pattern_walk2 by passing special functor with conditional assignment. So, it is enough to define a couple of functions like that: https://github.com/oneapi-src/oneDPL/blob/main/include/oneapi/dpl/pstl/glue_algorithm_impl.h#L332 https://github.com/oneapi-src/oneDPL/blob/main/include/oneapi/dpl/pstl/glue_algorithm_impl.h#L347
Or other words, it seems we dont need new patterns __pattern_walk2_transform_if and __pattern_walk3_transform_if.
@MikeDvorskiy
As described offline, before this PR, our range implementation only copies host iterator data to the device prior to a kernel when read
or read_write
access mode is requested. For transform_if
, we don't need read
access for our output range, but we do need to copy the output range to the device prior to the kernel to prevent untouched elements from being overwritten by uninitialized data on the copy back when the predicate is false
.
We do this by adding a template parameter, _ForceCopyDirectForHostIter
, to __get_sycl_range
which forces this copy (for host iterators) without requiring read_write
access mode in the kernel itself.
This is sycl implementation specific, so to communicate that extra requirement to __pattern_walk[2,3]
, we need another layer specific to transform_if in include/oneapi/dpl/pstl/hetero/algorithm_impl_hetero.h
. This is why we need __pattern_walk[2,3]_transform_if
.
For
transform_if
, we don't needread
access for our output range
Just "write" means read as well. I clarified it some time ago... Dont remember which way... probably, SYCL spec + discussion with guys who are responsible for SYCL language.
To be honest, I don't understand the problem with host iterator and "copy back" feature, which is a kind of implementation details.
So, summary: We have a couple of common middle layer algo patterns - pattern_walk2 and pattern_walk3.
- pattern_walk2 by default accepts the first sequence for reading and the second one for writing (reading and writing).
- pattern_walk3 by default accepts the first sequence and the second one for reading and the third one for writing (reading and writing). For USM it mode accesses don't matter, right.? For SYCL buffers => we have correct mode access by default, right? For the host iterators: The input sequences are being copied just one direction - to device. The output sequence is being copied just one direction - to the host.
Which my sentence is not correct? and where we have an issue here?
BTW, other idea = to try just transform with conditinal_transform functor and output_transform iterator with std::ignore
Just "write" means read as well. I clarified it some time ago... Dont remember which way... probably, SYCL spec + discussion with guys who are responsible for SYCL language.
Yes you are correct, I just confirmed this with James Brodman, write
implies a copy in of the data for SYCL buffers unless property::no_init
is also specified.
However, before this PR, our host_iterator specialization of __get_sycl_range
does not initialize the buffer with incoming data when write
access mode is requested (see copy_direct_tag
below):
https://github.com/oneapi-src/oneDPL/blob/07de70c504282266b7d5796c808e787b70e471fb/include/oneapi/dpl/pstl/hetero/dpcpp/utils_ranges_sycl.h#L502-L526
Therefore, for incoming host iterators:
- For the host iterators: The input sequences are being copied just one direction - to device. The output sequence is being copied just one direction - to the host.
This is the problem ^. We need to copy the output sequence to the device as well. SYCL does this with buffers in write
access mode, but we don't for incoming host iterators.
For instance, if we have a predicate that always returns false
, all initial data in the output sequence is the appropriate final output. We are copying back to the host, so we need to copy initial output sequence to the device, otherwise we will overwrite with it uninitialized data.
If we make the change to always copy host iterators in with write
access mode, it will be a slowdown in all patterns which use that, so I opted to only do it for transform_if
where it is explicitly required.
General note: We may want to consider using property::no_init
in more cases to avoid this copy in.
other idea = to try just transform with conditinal_transform functor and output_transform iterator with std::ignore
We don't have an output_transform iterator in oneDPL, currently only in SYCLomatic headers. Also, this would not change the need for the layer in the hetero backend to force the copy in of host_iterator output sequences into the buffer in the hetero backend. I don't think we can get around that (unless we start copying in all host_iterators in write mode).
The important takeaways are this:
-
write
andread_write
are very similar or the same for SYCL kernel optimization. - We currently don't match SYCL spec with regard to host_iterators and their copy_in behavior dictated by access mode.
With all this in mind, I think the best way forward in this PR is to go back to requiring read_write
access mode for transform_if
with a TODO to re-evaluate the __get_sycl_range
function to also have an optional no_init
property to align with the sycl spec. Revert my changes to explicitly force copy in for host iterator because it seems that this will not hurt from an optimization perspective, and is more aligned with the structure of the SYCL spec.
In a future PR, we can add the no_init
property to __get_sycl_range
, and convert the usages of the __get_sycl_range
with write
access mode to explicitly have write
with no_init
where applicable. We can then apply the same rules as the SYCL spec to our host_iterator specialization, copying in sequences with write
access and without no_init
property. That will allow us to use the write
access mode for transform_if
, but we will also want to change transform
to write
with no_init
.
In the end, we should have some performance boost from not requiring copy in for kernels where it is unnecessary.
After some offline discussion with @MikeDvorskiy and @rarutyun, it has been decided that this PR will be delayed until after a discussion of iterators from std::vector (host iterators).
The use of host iterators and a hetero policy with transform_if
will have disappointing performance because of the need to copy the output sequence to the device prior to the kernel (and then copy back after). It seems that these copies are unavoidable, but the question is whether the use of this API with host iterators should be allowed at all.
The use of host iterators and a hetero policy with transform_if will have disappointing performance because of the need to copy the output sequence to the device prior to the kernel (and then copy back after). It seems that these copies are unavoidable, but the question is whether the use of this API with host iterators should be allowed at all.
Yes, copying seems unavoidable. I thought of some "smart" wrapper similar to std::optional
which could avoid rewriting elements that were not updated on the device; but that would not work either, as such a type would not be device-copyable.
Yet - yes, if we introduce such an algorithm, we do need it to work with all the same input containers as the others. Why do we expect it to have "disappointing " performance? Would not be a SYCL buffer built on top of a std::vector
have the same performance?
If we want to avoid extra copying in the described scenario, we could try using a separate buffer of std::optional
s, ensure that the implementation initializes all of them on the device (of course to an empty state for elements that do not match the predicate), and then on the host have a separate loop that copies data back to the original sequence omitting the empty elements.
@akukanov
Why do we expect it to have "disappointing " performance?
Basically, with an output host iterator we need to copy to the device prior to the kernel and back after the kernel, where many APIs would just require the copy back. However, I don't think this is different from something like for_each
used in a similar way.
Would not be a SYCL buffer built on top of a
std::vector
have the same performance?
It would be the same assuming that the buffer data started on the host before the kernel and was requested again on the host after the kernel. However, a user created SYCL buffer the option to possibly keep the data on the device after the kernel. Using output host iterators explicitly copies in and out no matter the surrounding usage.
If the transform is computationally expensive enough to warrant the transfers, the performance becomes more reasonable. The challenge may be more about documentation and making users aware of what they are doing when directly using host iterators than about implementation of the algorithm.
I'm interested in the temporary std::optional approach you mention. It is something worth considering. I think this would be a specialization specifically for host iterators, since we don't manage the location of data in SYCL buffers or USM memory. We would basically be adding a host-side par_unseq
transform_if
where the functor is a copy from the temp to the output and the predicate is checking the std::optional
for a value, however it removes one direction of a memory transfer to the device.
For transform_if
calls with simple functors and predicates using host iterators, the user would still likely have better performance just using a par_unseq
policy, but this would mitigate the harm in part.
Probably, it makes sense to use dpl::for_each + zip(data1, data2, res) + functor with predicate? In that case we have just one access mode for all sequencies - read_write.
Probably, it makes sense to use dpl::for_each + zip(data1, data2, res) + functor with predicate? In that case we have just one access mode for all sequencies - read_write.
@MikeDvorskiy For host_iterator inputs, wouldn't that add an unnecessary copy-back for data1 and data2 which were previously read (copy-in) only?
For host_iterator inputs, wouldn't that add an unnecessary copy-back for data1 and data2 which were previously read (copy-in) only?
right.... In that case - "transform" + read_write mode for "result".
BTW, my general thoughts. Looking at a scenario like this wider - actually we need a way(approach) to "transfer data by mask" between device and host. If we consider an approach when we do copy data to a device and back explicitly, we might implement such copying by mask in a kernel, for example.
BTW, my general thoughts. Looking at a scenario like this wider - actually we need a way(approach) to "transfer data by mask" between device and host. If we consider an approach when we do copy data to a device and back explicitly, we might implement such copying by mask in a kernel, for example.
I'm not sure we gain much benefit from this in the "random" case, only in the cases where we have large contiguous blocks of untouched memory (and only with host_iterator output). Also, I'm not sure how one writes to host memory in a kernel unless you are talking about the kernel being a conversion of a device buffer to a contiguous sparse representation to then transfer to the host to then be unpacked.
For host_iterator inputs, wouldn't that add an unnecessary copy-back for data1 and data2 which were previously read (copy-in) only?
right.... In that case - "transform" + read_write mode for "result".
That is effectively what this PR does. transform
calls __pattern_walk2
or __pattern_walk3
based on the overload. transform_if
calls the same underlying patterns, but delays the call of that pattern until the hetero backend because that is the only place where adjustment of the access mode is available / necessary.