celerity-runtime
celerity-runtime copied to clipboard
Add celerity blockchain for task divergence checking
This pull request adds a divergence checking mechanism for tasks.
It does so by periodically gathering hashes of all tasks from task_recording and comparing them. When a divergence is detected an error containing the diverged tasks and their full task record is printed like:
[2023-10-02 17:31:07.784] [error] Divergence detected in task graph at index 1:
0x471b0f1db5e4b8e6 on nodes 1
0xe9fbff654e3748e1 on nodes 0
[2023-10-02 17:31:07.784] [error] Task record for hash 0x471b0f1db5e4b8e6:
id: 1, debug_name: task_b_4, type: device-compute, cgid: 0
geometry:
dimensions: 2, global_size: [1,1,1], global_offset: [0,0,0], granularity: [1,1,1]
accesses:
bid: 0, buffer_name: , mode: R, req: {[64,0,0] - [128,1,1]}
dependencies:
node: 0, kind: true-dep, origin: last-epoch
Additionally it also includes a rudimentary deadlock detection for nodes which are stuck by printing a warning after a given amount of time (eg 10 seconds):
[warning] After 10 seconds of waiting nodes 1, did not move to the next task. The runtime might be stuck.
All of this is automatically turned on by running the program with task recording enabled.
Check-perf-impact results: (5a19ced85f862a00d0114dd241122462)
:question: No new benchmark data submitted. :question:
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.
Check-perf-impact results: (3b34e58e3c100f4c3541a1ed59580f72)
:question: No new benchmark data submitted. :question:
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.
Check-perf-impact results: (4c65f1399a47e0eb1340f63004745b17)
:question: No new benchmark data submitted. :question:
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.
Okay so as discussed offline, we won't include this in 0.5.0 as it needs another revision. The main points:
- Deadlock detection as-is would produce too many false positive warnings; not sure yet how to proceed on this.
- Testing infrastructure invokes UB; needs multi-threading to properly mock blocking collective operations
- We should have a test case (distr_test / integration test?) that exercises the case that one node submits a task while the other does not (the divergence then occurs between that task and the shutdown epoch).