ScatterView Improvements
Just some random thoughts before we take this out of experimental:
- make all the
int enumstag classes- would enable
make_scatter_view(view,ScatterSum,ScatterAtomic,ScatterNonDuplicated)
- would enable
- should this thing have
extent(..)? - getting the internal view works via "subview"??
- ScatterValue:
- should they be copyable at all?
- what is the internal handle type
- remove join
- consistently use the update function in operators
Also see #1390 and #1392.
With emphasis on this comment by @crtrott https://github.com/kokkos/kokkos/issues/1390#issuecomment-363463835
And #2276, which is related to @crtrott's third bullet point.
Via Matt: ScatterView should have a way to tell if it is allocated. Using subview().extent(0) can segfault as it doesn't construct safe with the default ctr.
- Getting rid of
reset_exceptreplace withreset_duplicates - add
is_allocatedto all the data classes.
One improvement, and I'm not sure how much it would help, thinking GPU here, if you have a block dimension of (1, 32, 1) you would replicate 32x across the blocks, and use atomics on these data sets. This way you don't have different grids accessing the atomics...
More thoughts:
- make ScatterView::subview() consistent with View::subview() //semantic mismatch
Hi @janciesko -- is this issue still on your RADAR?
Here is a bit more on ScatterView, I gonna start tackling this for real:
Let's start with two use-case scenarios: wrapping and non-wrapping. And let us assume we had decent CTAD and that we had an extents() function from mdspan so we can easily compare dimensions.
Wrapping ScatterView
- Need to scatter into an existing
Viewthat is used elsewhere in the code. - Do not want overhead (time and memory capacity) of extra allocation on GPUs
void foo(view_type v) {
ScatterView sv(v);
// Kernel to do something with it
parallel_for(....);
sv.contribute_into(v);
}
Issues:
- On CPUs with duplication I have allocation and deallocation of the duplicates
- awkward that I have to pass
vagain tocontribute_into, and what if I pass something else tocontribute_into, wasvthen modified or not?
Version where we keep the ScatterView alive
struct Foo {
ScatterView<....> sv;
void compute(view_type v) {
// the is_allocated is there for all static extents case
if(!sv.is_allocated || sv.extents() != v.extents()) {
sv = ScatterView{v};
} else {
// set duplicates to zero again if necessary
// but don't reset `v` since we may want to add to existing values
sv.reset_except(v);
}
// Use ScatterView
parallel_for(...);
sv.contribute_into(v);
}
};
- again
contribute_intois awkward but now so isreset_except.
Non-Wrapping ScatterView
- I just need to accumulate contributions most efficiently
- Use the result in one place but don't keep it around for other code areas
void foo() {
ScatterView<int**> sv("SV", N);
// Kernel to do some accumulation
parallel_for(...);
// get the "master" View
View<int**> v = sv.subview();
sv.contribute_into(v);
// Kernel to do something with v
}
Issues:
- Uhm
subviewto get the "master"View? - Ill defined what the values of
vare beforecontribute_into - awkward that I have to pass
vagain tocontribute_into
Version keeping it around
struct Foo {
ScatterView<....> sv;
Foo(...) {
sv = ScatterView{"Name", N, M};
}
void compute() {
sv.reset();
// do something with sv
parallel_for(...);
auto v = sv.subview();
sv.contribute_into(v);
// do something with v
parallel_for(...);
}
};
Fundamental usage proposal:
- combine
reset_exceptandcontribute_intointo aconsolidatefunction- accumulates values from duplicates into the "master" view and resets them to neutral
- add a 'rebind(view)' function to streamline the reuse in the wrapping use-case
- would enable reusing the same ScatterView for multiple views of compatible layout (and size?)
- disallow
rebindwhen in "Non-Wrapping" mode. - disallow
resizeandreallocwhen in "wrapping" mode
- rename
subviewtoresult_viewor something similar
Lets discuss interfaces first:
template<class DataType, class Layout, class DeviceType, class Op, class Duplication, class Contribution>
class ScatterView {
public:
// Ctors:
ScatterView() = default;
template<class ... ViewArgs>
ScatterView(View<ViewArgs...>);
template<class ... ViewArgs>
ScatterView(execution_space, View<ViewArgs...>);
template<class ... Extents>
ScatterView(std::string label, Extents .... exts);
template<class... P, class ... Extents>
ScatterView(ViewCtorProp<P...>, Extents .... exts);
};
@crtrott, @janciesko -- updated ScatterView documentation is ready for review 😄
Here is the usage for this thing for the above mentioned cases:
Wrapping ScatterView
- Need to scatter into an existing
Viewthat is used elsewhere in the code. - Do not want overhead (time and memory capacity) of extra allocation on GPUs
void foo(view_type v) {
ScatterView sv(v);
// if you need to zero v out do it here: deep_copy(v, 0)
// Kernel to do something with it
parallel_for(....);
sv.aggregate();
}
Version where we keep the ScatterView alive
struct Foo {
ScatterView<....> sv;
void compute(view_type v) {
// the is_allocated is there for all static extents case
if(!sv.is_allocated || sv.extents() != v.extents()) {
sv = ScatterView{v};
} else {
sv.rebind(v);
}
// Use ScatterView
parallel_for(...);
sv.aggregate();
}
};
Non-Wrapping ScatterView
- I just need to accumulate contributions most efficiently
- Use the result in one place but don't keep it around for other code areas
void foo() {
ScatterView<int**> sv("SV", N, M);
// Kernel to do some accumulation
parallel_for(...);
sv.aggregate();
// get the "master" View
View<int**> v = sv.target_view();
// Kernel to do something with v
}
Version keeping it around
struct Foo {
ScatterView<....> sv;
Foo(...) {
sv = ScatterView{"Name", N, M};
}
void compute() {
// resetting values
deep_copy(sv.target_view(), 0);
// do something with sv
parallel_for(...);
sv.aggregate();
// do something with v
parallel_for(...);
}
};
And a compact test case:
template<class SV>
void test_scatter_view(SV as) {
auto a = as.target_view();
EXPECT_EQ(a(0),1);
// ScatterView does not overwrite values in "target_view"
Kokkos::parallel_for(20, KOKKOS_LAMBDA(int i) {
auto s = as.access();
s(i%10) += 10;
});
as.aggregate();
EXPECT_EQ(a(0),21);
// previous aggregate resets potential duplicates but not the whole thing
// you can run multiple kernels before aggregating
Kokkos::parallel_for(20, KOKKOS_LAMBDA(int i) {
auto s = as.access();
s(i%10) += 11;
});
Kokkos::parallel_for(20, KOKKOS_LAMBDA(int i) {
auto s = as.access();
s(i%10) += 12;
});
as.aggregate();
EXPECT_EQ(a(0),67);
// setting target_view back to 0 means we start from blank slate
Kokkos::deep_copy(a, 0);
Kokkos::parallel_for(20, KOKKOS_LAMBDA(int i) {
auto s = as.access();
s(i%10) += 13;
});
as.aggregate();
EXPECT_EQ(a(0),26);
}
template<class Layout, class Strategy>
void test_scatter_view() {
using sv_t = Kokkos::Experimental::
ScatterView<int*, Layout, Kokkos::DefaultExecutionSpace,
Kokkos::Experimental::ScatterSum, Strategy>;
// wrapping
{
Kokkos::View<int*> a("A", 10);
Kokkos::deep_copy(a, 1);
sv_t as(a);
test_scatter_view(as);
Kokkos::printf("%i\n",a(0));
}
// non-wrapping
{
sv_t as("ScatterView", 10);
auto at = as.target_view();
Kokkos::deep_copy(at, 1);
test_scatter_view(as);
}
}