aim
aim copied to clipboard
Re-initialize resources when remote tracker returns UnauthorizedRequestError
🚀 Feature
When remote tracker is restarted/redeployed, all its states are gone, and we get UnauthorizedRequestError on the client side if there's an ongoing training run. The client should be able to negotiate with the new server and re-initialize remote resources when that happens, so we can continue logging the current run without interruption.
Motivation
When we use remote tracker, it is not uncommon that we may need to restart/redeploy it, or it simply dies without any notice. When that happens, logging will be interrupted and it may block training run from finishing. We need to handle such cases, and at the very least training runs should not be blocked.
hey @jiyuanq! Thanks for the request.
Makes a total sense, and we had it on our plans as well, just didn't have the time yet to implement.
We'll prioritize this for the next minor release. (aim v3.14.0)
hey @jiyuanq! Thanks for the request. Makes a total sense, and we had it on our plans as well, just didn't have the time yet to implement. We'll prioritize this for the next minor release. (
aim v3.14.0)
Great to know! Looking forward to it!
Hey @jiyuanq! The re-initialization of resources feature has been shipped with version 3.15.0.
Please try it out and let us know if everything works as expected.
closing this issue as the enhancement was shipped.