dash [WIP] Introducing execution- and launch policies

Addresses issues #272 #104 #216

Feb 24 '17 00:02 fuchsto

Related to launch policies: Task concept specification (see #193)

Feb 24 '17 00:02 fuchsto

I think I got the required concepts figured out. This summary is mostly relevant to @devreal @fmoessbauer @fuerlinger

Execution Policies

In C++17, execution policies are introduced in algorithm function interfaces and passed as first argument:

template< class ExecutionPolicy, class InputIt, class UnaryFunction2 >
void for_each( ExecutionPolicy&& policy, InputIt first, InputIt last, UnaryFunction2 f );

It is therefore trivial to specialize / overload algorithm interfaces for execution policies. From the code-mechanics perspective, we can comfortably extend the STL execution policy concept. Introducing additional traits and some CRTP adapters should do.

The actual effort is in rather formal aspects. Which is nice, because science.

In principle, this boils down to additional PGAS-specific dimensions for execution policies. Some examples:

collaboration, such as collective, asymmetric, leader
synchronization such as nowait (as known from OpenMP, would skip implicit barriers)
unit-level parallelization, the vanilla STL execution policies, e.g. parallel would allow units to spawn OpenMP threads

... you get the idea. Policies can be arbitrarily combined with foo_policy | bar_policy as usual.

Launch Policies

STL concepts do not cover our requirements regarding launch policies. In vanilla C++, asynchronous and deferred execution is provided by the std::launch and std::async etc. function interfaces and algorithms are agnostic of launch policies:

template< class Function, class... Args >
std::future<typename std::result_of<Function(Args...)>::type>
    async( std::launch policy, Function&& fun, Args&&... args );

Parameter fun is the actual callable object.

Our algorithm variants depend on launch policies, however. For example, asynchronous copying dash::copy_async and blocking copying dash::copy are independent interfaces, and I don't think there is practicable alternative that is correct and efficient in the general case. Implementing blocking and deferred variants as wrappers of a single asynchronous implementation seems sensible, but in fact it hurts performance pretty badly.

I sketched out some alternatives, like:

Grouping algorithms by launch policy in namespaces, like dash::async::copy. Does not allow combination of policies, so it's not an option.
Meta-magic: dash::async could just replace callables depending on the launch policy. For example, passing dash::copy with dash::launch::async would invoke dash::copy_async instead. Nearly impossible to maintain and extremely irritating as the executed task is different from the specified callable.

So far, my initial approach based on phantom- and container proxy types looks like the best option to me. Instead of namespaces or function template specializations, algorithm variants are realized as overloads for iterator traits.

An iterator for async operations would have dash::iterator_traits<It>::async_policy_tag specified. Containers like dash::Array use adapters or proxies to convert iterators between policies. This is already implemented in dash::Array<T>::async and allows usage like this:

dash::Array<X> array(size);

// Default:
auto copy_end = dash::copy(array.begin(), array.end(), dest);
// -> dash::Array<X>::iterator

// Async:
auto copy_end = dash::copy(array.async.begin(), array.async.end(), dest);
// current implementation:
// -> dash::GlobAsyncIter<T>
// will be:
// -> dash::Future<dash::Array<X>::iterator>

// Lazy:
auto copy_end = dash::copy(array.deferred.begin(), array.deferred.end(), dest);
// -> dash::Array<X>::iterator

This is very close to the STL specs and there is no way to get it wrong for developers. No maintenance effort in algorithms at all, no coupling of concerns, much rejoice.

But also, it is important to note that we now can actually define dash::async and dash::launch like in STL, and replace iterator parameters automatically (but, different from replacing the callable parameter as described above, with robust semantics):

// Developer writes:
auto fut_copy_end = dash::async(
             dash::launch::async,
             dash::copy<Foo>,
             array.begin(),
             array.end());
// --> resolves to:
auto fut_copy_end = dash::async(
             dash::launch::async,
             dash::copy<Foo>,
             array.begin().async,  // same as array.async.begin()
             array.end().async);   // same as array.async.end()

This is semantically and conceptually robust:

GlobIter<T, ..., dash::launch::async> satisfies GlobIter<T, ..., dash::launch::any> and
The developer specified the launch policy explicitly, so the principle of least astonishment is respected, the behavior and signature corresponds to std::async.

So long story short: There is a chance we can have the cake and eat it, too.

In the current implementation in branch development, array.async is a fully specialized proxy accessor with dash::ArrayAsyncRef<T>::iterator = GlobAsyncIter<T>. I added those as proof of concept., actually these specializations are not necessary for every policy, we can use traits as well. This is preferrable as we otherwise had to define a fully specialized type for any combination of launch policies, and adding a policy would be the opposite of fun.

Looking forward to your thoughts!

Feb 25 '17 09:02 fuchsto

Thanks for this brain dump! I think I get the idea.

On Execution Policies I think this is clear and it helps to get rid of some of the confusion we currently have, e.g., which unit receives the results of a dash::accumulate call. One thing to note, though:

unit-level parallelization, the vanilla STL execution policies, e.g. parallel would allow units to spawn OpenMP threads

Keep in mind that DART will eventually have it's own thread-pool independent of OpenMP. OpenMP is just too restrictive to use it as a general-purpose task scheduler (unfortunately). Executing tasks in DART and OpenMP in parallel likely leads to contention and performance degradation.

On Launch Policies Just to put this into perspective of general tasking in DASH: calling dash::async on dash::copy with dash::launch::async will convert array.begin() to array.begin().async, which issues an asynchronous copy and the returned dash::Future can be used to wait for completion (through a flush), right?

Now, how is this implemented for other DASH algorithms? Do we provide overrides for them as well? For example, combining dash::launch::async with the collective execution policy sounds like a bad idea...

My idea of tasking in DASH currently is that calling dash::async with dash::launch::async specified on any other function will result in an asynchronously executed action, i.e., a task that is scheduled through DART to be executed at some point in time. Again, the dash::future is used to wait for completion. The DART side of this is pretty much there (although maybe not extensively tested).

There are two concepts that I would like to raise awareness of:

Task dependencies I've been working on a way to define an ordering of tasks through data dependencies in DART by specifying arbitrary memory locations in global memory (similar to OpenMP task data dependencies). I haven't thought much about the DASH abstraction although I think dash::async would suite well. This would require dash::async to optionally accept dependency definitions.
Fire-and-forget or Bulk Tasks Similar to OpenMP tasks, users may want to define tasks without explicitly waiting for them individually. Instead, tasks are defined in bulk and waited for by a single statement. This is almost necessary if you want to taskify one or multiple iterations of your algorithm. We could either introduce a launch policy for dash::async for this type of tasks (returning an empty dash::Future) or have a separate abstraction for it (such as dash::Task::create and dash::Task::complete). I'm OK with both, I just want to make sure we are covering this use-case.

We should definitely discuss this in Garching. I think we are arriving at a state where the tasking in DART is ready to be made available (though maybe not in its full feature-set). I've been holding it back until after the (ominous) next release so far.

Feb 28 '17 07:02 devreal

@devreal

Keep in mind that DART will eventually have it's own thread-pool independent of OpenMP

Of course! Just using OpenMP as a reference, but we want light-weight tasks, of course.

My idea of tasking in DASH currently is that calling dash::async with dash::launch::async specified on any other function will result in an asynchronously executed action

That's not a contradiction: the conversion to array.begin().async is internal, the visible part of the interface is just the return type dash::Future<T> which is also what tasks would return. It's definitely important to support "real" task models like yours, I kept that in mind in my proposal.

But on the other hand, how would you replace the current dash::copy_async by tasks? Asynchronous one-sided MPI will be impossible to beat in efficiency, wrapping a blocking dash::copy in a task would not help.

I assume that eventually, the current async interface of DART will be wrapped in a proper task interface so the C++ layer could prefer tasks to abstract async operations - Still, we must be able to support these "async-specialized implementation" cases like dash::copy_async, too.

We need both, and I think / hope the proposed interface could do.

Feb 28 '17 08:02 fuchsto

But on the other hand, how would you replace the current dash::copy_async by tasks?

Not at all, it does not make sense to implement dash::copy_async in terms of tasks as the time is mostly spent in data transfer. I was just wondering about the other DASH algorithms :)

We need both, and I think / hope the proposed interface could do.

Absolutely, I just wanted to make sure we are on the same page here. I will present my current status in Garching and based on that we can discuss the details of the tasking abstraction in it's full beauty :)

Feb 28 '17 08:02 devreal

Ah, I got a good feeling about this.

Ehhmm, when exactly is the SPPEXA name-dancing in Garching again?

Feb 28 '17 09:02 fuchsto

Ah, here we go:

http://www.sppexa.de/sppexa-activities/annual-plenary-meeting/2017.html

That's a good time frame. Did you happen to review HPX and Kokkos (also: Alpaka), yet? I'm about to make DASH views and Kokkos views fancy each other.

Feb 28 '17 09:02 fuchsto

I had a look at Kokkos (see my comments in #193) and will attend a workshop on HPX here at HLRS next week. I haven't looked at Alpaca, yet. Could that be relevant as well?

Feb 28 '17 09:02 devreal

Yes, Alpaka is relevant as well. I'm just here at the Alpaka Kickoff meeting, will report back tomorrow.

Feb 28 '17 09:02 fuerlinger

@fuerlinger Gotcha. If you can provide us with some material I'll be happy to look at it before the Garching meeting.

Feb 28 '17 09:02 devreal

Can we get this into 0.3.0? If not, I would start working on a quick fix for dash::accumulate and document the deficiencies so we can close #216.

Mar 29 '17 11:03 devreal

@devreal I can get this into 0.3.0 this week.

Apr 24 '17 12:04 fuchsto

@fuchsto May I hijack this PR to add the few lines needed for Alpakas for_each and maybe transform example? It currently does not rely on these changes, but it wouldn't be much work to base the MEPHISTO-stuff (in the MEPHISTO repo) on dash/include/dash/Execution.h

May 03 '18 07:05 pascalj

dash dash copied to clipboard

[WIP] Introducing execution- and launch policies

dash
dash copied to clipboard