broadway icon indicating copy to clipboard operation
broadway copied to clipboard

Test Reconnecting Nodes

Open ayushr2 opened this issue 5 years ago • 2 comments

We should test that when a node reconnects, their info is sustained and they can continue from where they left.

ayushr2 avatar Jan 29 '20 04:01 ayushr2

There is one scenario in the websocket version of the protocol that's currently problematic:

When API restarts, graders will be interrupted with a disconnection exception. If there are ongoing jobs in graders in the restarting period, API will think the grader is still working on the original job (since in the http protocol graders would still continue the job and submit in this case). The "running" flag of such grader is not cleared in API once the restart is done.

I think we should probably:

  1. add reconnecting mechanism in the websocket version of grader instead of relying entirely on docker (and try to preserve the status of an on-going job as long as possible)
  2. add some kind of draining mechanism in API to temporarily block incoming grading requests when we are expecting a restart to happen (e.g. deploying new version)

zhengyao-lin avatar Feb 07 '20 00:02 zhengyao-lin

Nice idea. We would need to drain the queue before we can shutdown the API.

ayushr2 avatar Feb 07 '20 21:02 ayushr2