[KT] Single Pass Copy_if Kernel Template
This PR adds a pair of APIs (iterator and range variants) for :
oneapi::dpl::experimental::kt::gpu::copy_if
which take an input and output sequence, as well as a sequence representing a single element to store the number of elements copied (which is left on the device) and a predicate. It copies each element from the input to the output which satisfies the predicate and records the number of elements copied, preserving the relative order of the elements.
Other additions within this PR
- Tests for these new APIs
Additional notable details:
- Refactor of oneDPL mainline
copy_ifsingle workgroup implementation to lift out the "copy to host" of the num copied return value one level and enable use by the new kernel template - Refactor of
scankernel template to share lookback phase, allocation manager withcopy_if - Adjust lookback phase to rely upon the last subgroup / last work-item rather than the first subgroup / first work-item to do operations which we want only a single subgroup or work-item to do. This enables propagation of "running" scan values without extra intra-workgroup communication for
copy_if. I don't believe this change negatively impactsscanKT.
Adapted from previous work by AidanBeltonS, Alcpz, joeatodd, adamfidel
Interestingly, there originally was a regression (~10%) in scan performance by using the last subgroup, last workitem of the subgroup and originating a broadcast from the last workitem of the workgroup, rather than the zeroth of each to perform the "solo" actions in the lookback.
I do not have an understanding of why this might be. copy_if needs to use the last here to take advantage of the location of the data which needs to be communicated. I've adjusted the shared helper function for lookback to allow the individual algorithm to dictate the active subgroup, workitem and source for the broadcast, and this repaired the performance regression for scan.
I suggest prioritizing #1762 and #1763 over this PR for now. If those go through, the performance of the oneDPL main copy_if API will supersede this KT. If we see significant risk that the above mentioned PRs will miss the release, we should pivot to merge this PR. Once those merge, I will remove this from the release milestone, until further improvements can be incorporated into this KT which enable it to have value on its own.
At this point, I think this PR is more difficult to land in 2022.7.0 than the first two reduce_then_scan PRs, and provides worse performance, so I'm pulling this from the milestone, and converting to draft.
This may resurface with concepts from reduce_then_scan combined with the lookback to provide enhanced performance over mainline oneDPL, but for now we shouldn't be prioritizing this.