aim icon indicating copy to clipboard operation
aim copied to clipboard

Re-initialize resources when remote tracker returns UnauthorizedRequestError

Open jiyuanq opened this issue 3 years ago • 2 comments
trafficstars

🚀 Feature

When remote tracker is restarted/redeployed, all its states are gone, and we get UnauthorizedRequestError on the client side if there's an ongoing training run. The client should be able to negotiate with the new server and re-initialize remote resources when that happens, so we can continue logging the current run without interruption.

Motivation

When we use remote tracker, it is not uncommon that we may need to restart/redeploy it, or it simply dies without any notice. When that happens, logging will be interrupted and it may block training run from finishing. We need to handle such cases, and at the very least training runs should not be blocked.

jiyuanq avatar Sep 01 '22 03:09 jiyuanq

hey @jiyuanq! Thanks for the request. Makes a total sense, and we had it on our plans as well, just didn't have the time yet to implement. We'll prioritize this for the next minor release. (aim v3.14.0)

mihran113 avatar Sep 01 '22 10:09 mihran113

hey @jiyuanq! Thanks for the request. Makes a total sense, and we had it on our plans as well, just didn't have the time yet to implement. We'll prioritize this for the next minor release. (aim v3.14.0)

Great to know! Looking forward to it!

jiyuanq avatar Sep 01 '22 12:09 jiyuanq

Hey @jiyuanq! The re-initialization of resources feature has been shipped with version 3.15.0. Please try it out and let us know if everything works as expected.

mihran113 avatar Dec 05 '22 18:12 mihran113

closing this issue as the enhancement was shipped.

gorarakelyan avatar Feb 10 '23 12:02 gorarakelyan