dash
dash copied to clipboard
[WIP] Introducing execution- and launch policies
Addresses issues #272 #104 #216
Related to launch policies: Task concept specification (see #193)
I think I got the required concepts figured out. This summary is mostly relevant to @devreal @fmoessbauer @fuerlinger
Execution Policies
In C++17, execution policies are introduced in algorithm function interfaces and passed as first argument:
template< class ExecutionPolicy, class InputIt, class UnaryFunction2 >
void for_each( ExecutionPolicy&& policy, InputIt first, InputIt last, UnaryFunction2 f );
It is therefore trivial to specialize / overload algorithm interfaces for execution policies. From the code-mechanics perspective, we can comfortably extend the STL execution policy concept. Introducing additional traits and some CRTP adapters should do.
The actual effort is in rather formal aspects. Which is nice, because science.
In principle, this boils down to additional PGAS-specific dimensions for execution policies. Some examples:
-
collaboration, such as
collective
,asymmetric
,leader
-
synchronization such as
nowait
(as known from OpenMP, would skip implicit barriers) -
unit-level parallelization, the vanilla STL execution policies, e.g.
parallel
would allow units to spawn OpenMP threads
... you get the idea. Policies can be arbitrarily combined with foo_policy | bar_policy
as usual.
Launch Policies
STL concepts do not cover our requirements regarding launch policies.
In vanilla C++, asynchronous and deferred execution is provided by the std::launch
and std::async
etc. function interfaces and algorithms are agnostic of launch policies:
template< class Function, class... Args >
std::future<typename std::result_of<Function(Args...)>::type>
async( std::launch policy, Function&& fun, Args&&... args );
Parameter fun
is the actual callable object.
Our algorithm variants depend on launch policies, however. For example, asynchronous copying dash::copy_async
and blocking copying dash::copy
are independent interfaces, and I don't think there is practicable alternative that is correct and efficient in the general case.
Implementing blocking and deferred variants as wrappers of a single asynchronous implementation seems sensible, but in fact it hurts performance pretty badly.
I sketched out some alternatives, like:
- Grouping algorithms by launch policy in namespaces, like
dash::async::copy
. Does not allow combination of policies, so it's not an option. - Meta-magic:
dash::async
could just replace callables depending on the launch policy. For example, passingdash::copy
withdash::launch::async
would invokedash::copy_async
instead. Nearly impossible to maintain and extremely irritating as the executed task is different from the specified callable.
So far, my initial approach based on phantom- and container proxy types looks like the best option to me. Instead of namespaces or function template specializations, algorithm variants are realized as overloads for iterator traits.
An iterator for async operations would have dash::iterator_traits<It>::async_policy_tag
specified.
Containers like dash::Array
use adapters or proxies to convert iterators between policies.
This is already implemented in dash::Array<T>::async
and allows usage like this:
dash::Array<X> array(size);
// Default:
auto copy_end = dash::copy(array.begin(), array.end(), dest);
// -> dash::Array<X>::iterator
// Async:
auto copy_end = dash::copy(array.async.begin(), array.async.end(), dest);
// current implementation:
// -> dash::GlobAsyncIter<T>
// will be:
// -> dash::Future<dash::Array<X>::iterator>
// Lazy:
auto copy_end = dash::copy(array.deferred.begin(), array.deferred.end(), dest);
// -> dash::Array<X>::iterator
This is very close to the STL specs and there is no way to get it wrong for developers. No maintenance effort in algorithms at all, no coupling of concerns, much rejoice.
But also, it is important to note that we now can actually define dash::async
and dash::launch
like in STL, and replace iterator parameters automatically (but, different from replacing the callable parameter as described above, with robust semantics):
// Developer writes:
auto fut_copy_end = dash::async(
dash::launch::async,
dash::copy<Foo>,
array.begin(),
array.end());
// --> resolves to:
auto fut_copy_end = dash::async(
dash::launch::async,
dash::copy<Foo>,
array.begin().async, // same as array.async.begin()
array.end().async); // same as array.async.end()
This is semantically and conceptually robust:
-
GlobIter<T, ..., dash::launch::async>
satisfiesGlobIter<T, ..., dash::launch::any>
and - The developer specified the launch policy explicitly, so the principle of least astonishment is respected, the behavior and signature corresponds to
std::async
.
So long story short: There is a chance we can have the cake and eat it, too.
In the current implementation in branch development
, array.async
is a fully specialized proxy accessor with dash::ArrayAsyncRef<T>::iterator = GlobAsyncIter<T>
.
I added those as proof of concept., actually these specializations are not necessary for every policy, we can use traits as well.
This is preferrable as we otherwise had to define a fully specialized type for any combination of launch policies, and adding a policy would be the opposite of fun.
Looking forward to your thoughts!
Thanks for this brain dump! I think I get the idea.
On Execution Policies
I think this is clear and it helps to get rid of some of the confusion we currently have, e.g., which unit receives the results of a dash::accumulate
call. One thing to note, though:
unit-level parallelization, the vanilla STL execution policies, e.g. parallel would allow units to spawn OpenMP threads
Keep in mind that DART will eventually have it's own thread-pool independent of OpenMP. OpenMP is just too restrictive to use it as a general-purpose task scheduler (unfortunately). Executing tasks in DART and OpenMP in parallel likely leads to contention and performance degradation.
On Launch Policies
Just to put this into perspective of general tasking in DASH: calling dash::async
on dash::copy
with dash::launch::async
will convert array.begin()
to array.begin().async
, which issues an asynchronous copy and the returned dash::Future
can be used to wait for completion (through a flush
), right?
Now, how is this implemented for other DASH algorithms? Do we provide overrides for them as well? For example, combining dash::launch::async
with the collective
execution policy sounds like a bad idea...
My idea of tasking in DASH currently is that calling dash::async
with dash::launch::async
specified on any other function will result in an asynchronously executed action, i.e., a task that is scheduled through DART to be executed at some point in time. Again, the dash::future
is used to wait for completion. The DART side of this is pretty much there (although maybe not extensively tested).
There are two concepts that I would like to raise awareness of:
-
Task dependencies I've been working on a way to define an ordering of tasks through data dependencies in DART by specifying arbitrary memory locations in global memory (similar to OpenMP task data dependencies). I haven't thought much about the DASH abstraction although I think
dash::async
would suite well. This would requiredash::async
to optionally accept dependency definitions. -
Fire-and-forget or Bulk Tasks Similar to OpenMP tasks, users may want to define tasks without explicitly waiting for them individually. Instead, tasks are defined in bulk and waited for by a single statement. This is almost necessary if you want to taskify one or multiple iterations of your algorithm. We could either introduce a launch policy for
dash::async
for this type of tasks (returning an emptydash::Future
) or have a separate abstraction for it (such asdash::Task::create
anddash::Task::complete
). I'm OK with both, I just want to make sure we are covering this use-case.
We should definitely discuss this in Garching. I think we are arriving at a state where the tasking in DART is ready to be made available (though maybe not in its full feature-set). I've been holding it back until after the (ominous) next release so far.
@devreal
Keep in mind that DART will eventually have it's own thread-pool independent of OpenMP
Of course! Just using OpenMP as a reference, but we want light-weight tasks, of course.
My idea of tasking in DASH currently is that calling dash::async with dash::launch::async specified on any other function will result in an asynchronously executed action
That's not a contradiction: the conversion to array.begin().async
is internal, the visible part of the interface is just the return type dash::Future<T>
which is also what tasks would return.
It's definitely important to support "real" task models like yours, I kept that in mind in my proposal.
But on the other hand, how would you replace the current dash::copy_async
by tasks?
Asynchronous one-sided MPI will be impossible to beat in efficiency, wrapping a blocking dash::copy
in a task would not help.
I assume that eventually, the current async interface of DART will be wrapped in a proper task interface so the C++ layer could prefer tasks to abstract async operations - Still, we must be able to support these "async-specialized implementation" cases like dash::copy_async
, too.
We need both, and I think / hope the proposed interface could do.
But on the other hand, how would you replace the current
dash::copy_async
by tasks?
Not at all, it does not make sense to implement dash::copy_async
in terms of tasks as the time is mostly spent in data transfer. I was just wondering about the other DASH algorithms :)
We need both, and I think / hope the proposed interface could do.
Absolutely, I just wanted to make sure we are on the same page here. I will present my current status in Garching and based on that we can discuss the details of the tasking abstraction in it's full beauty :)
Ah, I got a good feeling about this.
Ehhmm, when exactly is the SPPEXA name-dancing in Garching again?
Ah, here we go:
http://www.sppexa.de/sppexa-activities/annual-plenary-meeting/2017.html
That's a good time frame. Did you happen to review HPX and Kokkos (also: Alpaka), yet? I'm about to make DASH views and Kokkos views fancy each other.
I had a look at Kokkos (see my comments in #193) and will attend a workshop on HPX here at HLRS next week. I haven't looked at Alpaca, yet. Could that be relevant as well?
Yes, Alpaka is relevant as well. I'm just here at the Alpaka Kickoff meeting, will report back tomorrow.
@fuerlinger Gotcha. If you can provide us with some material I'll be happy to look at it before the Garching meeting.
Can we get this into 0.3.0? If not, I would start working on a quick fix for dash::accumulate
and document the deficiencies so we can close #216.
@devreal I can get this into 0.3.0 this week.
@fuchsto May I hijack this PR to add the few lines needed for Alpakas for_each
and maybe transform
example? It currently does not rely on these changes, but it wouldn't be much work to base the MEPHISTO-stuff (in the MEPHISTO repo) on dash/include/dash/Execution.h