celerity-runtime Add celerity blockchain for task divergence checking

This pull request adds a divergence checking mechanism for tasks.

It does so by periodically gathering hashes of all tasks from task_recording and comparing them. When a divergence is detected an error containing the diverged tasks and their full task record is printed like:

[2023-10-02 17:31:07.784] [error] Divergence detected in task graph at index 1:

0x471b0f1db5e4b8e6 on nodes 1 
0xe9fbff654e3748e1 on nodes 0 

[2023-10-02 17:31:07.784] [error] Task record for hash 0x471b0f1db5e4b8e6:

id: 1, debug_name: task_b_4, type: device-compute, cgid: 0
geometry:
         dimensions: 2, global_size: [1,1,1], global_offset: [0,0,0], granularity: [1,1,1]
accesses: 
         bid: 0, buffer_name: , mode: R, req: {[64,0,0] - [128,1,1]}
dependencies: 
         node: 0, kind: true-dep, origin: last-epoch

Additionally it also includes a rudimentary deadlock detection for nodes which are stuck by printing a warning after a given amount of time (eg 10 seconds):

[warning] After 10 seconds of waiting nodes 1, did not move to the next task. The runtime might be stuck.

All of this is automatically turned on by running the program with task recording enabled.

Oct 02 '23 15:10 GagaLP

Check-perf-impact results: (5a19ced85f862a00d0114dd241122462)

:question: No new benchmark data submitted. :question:
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

Oct 02 '23 15:10 github-actions[bot]

Check-perf-impact results: (3b34e58e3c100f4c3541a1ed59580f72)

:question: No new benchmark data submitted. :question:
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

Nov 27 '23 16:11 github-actions[bot]

Check-perf-impact results: (4c65f1399a47e0eb1340f63004745b17)

:question: No new benchmark data submitted. :question:
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

Dec 06 '23 14:12 github-actions[bot]

Okay so as discussed offline, we won't include this in 0.5.0 as it needs another revision. The main points:

Deadlock detection as-is would produce too many false positive warnings; not sure yet how to proceed on this.
Testing infrastructure invokes UB; needs multi-threading to properly mock blocking collective operations
We should have a test case (distr_test / integration test?) that exercises the case that one node submits a task while the other does not (the divergence then occurs between that task and the shutdown epoch).

Dec 20 '23 13:12 psalz

celerity-runtime celerity-runtime copied to clipboard

Add celerity blockchain for task divergence checking

celerity-runtime
celerity-runtime copied to clipboard