celerity-runtime
                                
                                
                                
                                    celerity-runtime copied to clipboard
                            
                            
                            
                        Capture buffer and host-object data on synchronization points
Currently, there is no native way to get results out of a Celerity buffer and back into the main thread at the end of a Celerity program. The current workaround is celerity::allow_by_ref and reference captures in host tasks, which is inelegant and error-prone.
SYCL offers host_accessor and explicit copy operations together with awaitable kernel events for that purpose. This is not a good fit for Celerity, since stalling the main thread is orders of magnitude more expensive in the distributed case.
This PR introduces Captures, a declarative API that allows requesting host objects and buffer subranges from distr_queue::slow_full_sync() and from runtime shutdown via the new distr_queue::drain() function. Specifying a capture adds the necessary dependencies and data transfers to the associated Epoch command and copies or moves the data out to the calling main thread once the epoch is reached, confining stalls to APIs which the user already expects to be slow.
Example
Drains the queue on program exit to receive a verification result in the main thread.
int main() {
    distr_queue q;
    host_object<bool> verification_passed;
    q.submit([=](handler &cgh) {
        side_effect verify{verification_passed, cgh};
        cgh.host_task(on_master_node, [=] {
            *verify = ...;
        });
    });
    return q.drain(capture{verification_passed}) ? 0 : 1;
                                    
                                    
                                    
                                
clang-tidy review says "All clean, LGTM! :+1:"
clang-tidy review says "All clean, LGTM! :+1:"
The CI failure on dpcpp:HEAD is due to a bug in Clang 15 and not related to this PR.
As discussed in person, the
std::in_placeconstructor ofhost_objectcurrently forwards to the initializer-list constructor ofT, if one exists, which is unexpected.
I have replaced the "universal" initializer syntax with the ()-syntax for non-aggregate types in places related to captures / host objects.
I don't fully understand why the
captureobject exists, or rather, why can't we pass host objects / buffers intodraindirectly? Is it only for specifying the buffer range?
It is about the buffer subranges, and also to clarify the meaning of arguments to slow_full_sync in user code.
I'm not sure about the naming of
drain. To me this doesn't really imply that this is the last operation I'm allowed to do on a queue, essentially destroying it. It sounds more like a different kind of sync, imo.
IMO this is a pretty common term for "waiting for things to finish" (e.g. SLURM uses it to mark nodes that are not accepting more work in order to shut down).
It is also not obvious in my opinion that
slow_full_syncproduces a copy, whiledrainmoves the object. Didn't an earlier version of this patch requirehost_objectto be moved intodrain?
I agree that the distinction is subtle and can be confusing. It also does not serve the use case of capturing a non-copyable object at a slow_full_sync barrier. It could be worth investigating to have the copy/move distinction instead on the capture level, e.g. by having an additional capture_by_move wrapper (this consideration only applies to host objects anyway since buffer data is always trivially copyable).
One thing that's "pretty" about the current solution is that a capture-on-drain does not introduce an additional shutdown-epoch after the capture-epoch. This is going to be pretty irrelevant from a performance perspective though.
I'm also still concerned that the semantic difference between host objects and buffers will be confusing to people (one is local to each worker, the other distributed). This difference existed before, but is now exacerbated by the fact that capturing a buffer returns the same data on all workers, whereas capturing a host object does (potentially) not.
I still agree on this, although I would perfer to revisit host objects after merging this (we're talking about experimental features anyway).
Offline discussion results:
- have a 
move_captureandcapture(copy) [ bikeshed name @psalz ] - get rid of 
drainentirely - in the future, implement 
fencewhich does not require a full barrier / epoch 
Superseded by #151.