celerity-runtime Capture buffer and host-object data on synchronization points

Currently, there is no native way to get results out of a Celerity buffer and back into the main thread at the end of a Celerity program. The current workaround is celerity::allow_by_ref and reference captures in host tasks, which is inelegant and error-prone.

SYCL offers host_accessor and explicit copy operations together with awaitable kernel events for that purpose. This is not a good fit for Celerity, since stalling the main thread is orders of magnitude more expensive in the distributed case.

This PR introduces Captures, a declarative API that allows requesting host objects and buffer subranges from distr_queue::slow_full_sync() and from runtime shutdown via the new distr_queue::drain() function. Specifying a capture adds the necessary dependencies and data transfers to the associated Epoch command and copies or moves the data out to the calling main thread once the epoch is reached, confining stalls to APIs which the user already expects to be slow.

Example

Drains the queue on program exit to receive a verification result in the main thread.

int main() {
    distr_queue q;
    host_object<bool> verification_passed;

    q.submit([=](handler &cgh) {
        side_effect verify{verification_passed, cgh};
        cgh.host_task(on_master_node, [=] {
            *verify = ...;
        });
    });

    return q.drain(capture{verification_passed}) ? 0 : 1;

Feb 21 '22 16:02 fknorr

clang-tidy review says "All clean, LGTM! :+1:"

Aug 16 '22 16:08 github-actions[bot]

clang-tidy review says "All clean, LGTM! :+1:"

Aug 17 '22 10:08 github-actions[bot]

The CI failure on dpcpp:HEAD is due to a bug in Clang 15 and not related to this PR.

Aug 17 '22 10:08 fknorr

As discussed in person, the std::in_place constructor of host_object currently forwards to the initializer-list constructor of T, if one exists, which is unexpected.

I have replaced the "universal" initializer syntax with the ()-syntax for non-aggregate types in places related to captures / host objects.

I don't fully understand why the capture object exists, or rather, why can't we pass host objects / buffers into drain directly? Is it only for specifying the buffer range?

It is about the buffer subranges, and also to clarify the meaning of arguments to slow_full_sync in user code.

I'm not sure about the naming of drain. To me this doesn't really imply that this is the last operation I'm allowed to do on a queue, essentially destroying it. It sounds more like a different kind of sync, imo.

IMO this is a pretty common term for "waiting for things to finish" (e.g. SLURM uses it to mark nodes that are not accepting more work in order to shut down).

It is also not obvious in my opinion that slow_full_sync produces a copy, while drain moves the object. Didn't an earlier version of this patch require host_object to be moved into drain?

I agree that the distinction is subtle and can be confusing. It also does not serve the use case of capturing a non-copyable object at a slow_full_sync barrier. It could be worth investigating to have the copy/move distinction instead on the capture level, e.g. by having an additional capture_by_move wrapper (this consideration only applies to host objects anyway since buffer data is always trivially copyable).

One thing that's "pretty" about the current solution is that a capture-on-drain does not introduce an additional shutdown-epoch after the capture-epoch. This is going to be pretty irrelevant from a performance perspective though.

I'm also still concerned that the semantic difference between host objects and buffers will be confusing to people (one is local to each worker, the other distributed). This difference existed before, but is now exacerbated by the fact that capturing a buffer returns the same data on all workers, whereas capturing a host object does (potentially) not.

I still agree on this, although I would perfer to revisit host objects after merging this (we're talking about experimental features anyway).

Sep 21 '22 14:09 fknorr

Offline discussion results:

have a move_capture and capture (copy) [ bikeshed name @psalz ]
get rid of drain entirely
in the future, implement fence which does not require a full barrier / epoch

Oct 11 '22 15:10 PeterTh

Superseded by #151.

Dec 14 '22 14:12 fknorr

celerity-runtime celerity-runtime copied to clipboard

Capture buffer and host-object data on synchronization points

Example

celerity-runtime
celerity-runtime copied to clipboard