celerity-runtime icon indicating copy to clipboard operation
celerity-runtime copied to clipboard

Add celerity blockchain for task divergence checking

Open GagaLP opened this issue 2 years ago • 4 comments

This pull request adds a divergence checking mechanism for tasks.

It does so by periodically gathering hashes of all tasks from task_recording and comparing them. When a divergence is detected an error containing the diverged tasks and their full task record is printed like:

[2023-10-02 17:31:07.784] [error] Divergence detected in task graph at index 1:

0x471b0f1db5e4b8e6 on nodes 1 
0xe9fbff654e3748e1 on nodes 0 

[2023-10-02 17:31:07.784] [error] Task record for hash 0x471b0f1db5e4b8e6:

id: 1, debug_name: task_b_4, type: device-compute, cgid: 0
geometry:
         dimensions: 2, global_size: [1,1,1], global_offset: [0,0,0], granularity: [1,1,1]
accesses: 
         bid: 0, buffer_name: , mode: R, req: {[64,0,0] - [128,1,1]}
dependencies: 
         node: 0, kind: true-dep, origin: last-epoch

Additionally it also includes a rudimentary deadlock detection for nodes which are stuck by printing a warning after a given amount of time (eg 10 seconds):

[warning] After 10 seconds of waiting nodes 1, did not move to the next task. The runtime might be stuck.

All of this is automatically turned on by running the program with task recording enabled.

GagaLP avatar Oct 02 '23 15:10 GagaLP

Check-perf-impact results: (5a19ced85f862a00d0114dd241122462)

:question: No new benchmark data submitted. :question:
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

github-actions[bot] avatar Oct 02 '23 15:10 github-actions[bot]

Check-perf-impact results: (3b34e58e3c100f4c3541a1ed59580f72)

:question: No new benchmark data submitted. :question:
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

github-actions[bot] avatar Nov 27 '23 16:11 github-actions[bot]

Check-perf-impact results: (4c65f1399a47e0eb1340f63004745b17)

:question: No new benchmark data submitted. :question:
Please re-run the microbenchmarks and include the results if your commit could potentially affect performance.

github-actions[bot] avatar Dec 06 '23 14:12 github-actions[bot]

Okay so as discussed offline, we won't include this in 0.5.0 as it needs another revision. The main points:

  • Deadlock detection as-is would produce too many false positive warnings; not sure yet how to proceed on this.
  • Testing infrastructure invokes UB; needs multi-threading to properly mock blocking collective operations
  • We should have a test case (distr_test / integration test?) that exercises the case that one node submits a task while the other does not (the divergence then occurs between that task and the shutdown epoch).

psalz avatar Dec 20 '23 13:12 psalz