aim train progress not updating after some iteration

🐛 Bug

After some train iteration training progress not updating in UI AIM UI Screenshot from 2023-04-24 14-33-46 console progress

Environment

Aim Version 3.17.3
Python version 3.10
pip version 22.3.1
OS Ubuntu 18.04 LTS
torch 2.0.0
pytorch lightning 2.0.1

Apr 24 '23 11:04 igor-byel

Hey @igor-byel! Do you see any errors/warnings in terminal running aim up? Is this a random issue or it happens consistently?

Apr 24 '23 11:04 alberttorosyan

Hey @igor-byel! Do you see any errors/warnings in terminal running aim up? Is this a random issue or it happens consistently?

Hi alberttorosyan

"Do you see any errors/warnings in terminal running aim up?"- No i do not see any error in console nor Aim up nor aim server nor on the client side

Screenshot from 2023-04-24 15-00-04

"Is this a random issue or it happens consistently?"-No sometimes it works as needed.
also my guess it is somehow correlates with updating frequence and number of updates client doing

Apr 24 '23 12:04 igor-byel

@igor-byel got it! Thanks for the additional info.

@mihran113, seems issue is related to remote tracking. Can you please take a look? Do you recall similar issues happening?

Apr 24 '23 12:04 alberttorosyan

Hey @igor-byel! The messages on server side indicate that some runs were terminated forcefully, or the network was gone for a long period of time. Is the client process log level on warning? Might there be a case that client side warnings haven't been displayed?

it should have been something like this:

'Network connection between client `{}` and server `{}` appears to be absent.'

Apr 25 '23 13:04 mihran113

Network connection between client

Hi @mihran113 i did not see any warnings in the client progress log.I will check farther and if i will see something i will update you

Apr 26 '23 16:04 igor-byel

Hi guys hope it will help i received such error from one train session

Remote Server is unavailable, please check network connection: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "Broken pipe" debug_error_string = "UNKNOWN:Error received from peer {created_time:"2023-05-02T12:04:53.795531307+03:00", grpc_status:14, grpc_message:"Broken pipe"}"

and in another one i get

Exception in thread Thread-3 (worker): Traceback (most recent call last): File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/home/igor/projects/ocr_ml_pipeline/venv/lib/python3.10/site-packages/aim/ext/transport/rpc_queue.py", line 55, in worker if self._try_exec_task(task_f, *args): File "/home/igor/projects/ocr_ml_pipeline/venv/lib/python3.10/site-packages/aim/ext/transport/rpc_queue.py", line 81, in _try_exec_task task_f(*args) File "/home/igor/projects/ocr_ml_pipeline/venv/lib/python3.10/site-packages/aim/ext/transport/client.py", line 299, in _run_write_instructions raise_exception(response.exception) File "/home/igor/projects/ocr_ml_pipeline/venv/lib/python3.10/site-packages/aim/ext/transport/message_utils.py", line 76, in raise_exception raise exception(*args) if args else exception() aim.ext.transport.message_utils.UnauthorizedRequestError:

May 02 '23 09:05 igor-byel

I'm experiencing similar errors - are there any solutions on some branches off develop? this one is a deal breaker for us...

Aug 17 '23 08:08 hstojic

We get the same UnauthorizedRequestError thrown, and our training thread will block indefinitely trying to push on to the RPC queue:

Thread 0x7F1339FEB480 (idle): "MainThread"
    wait (threading.py:320)
    register_task (aim/ext/transport/rpc_queue.py:42)
    flush_instructions_batch (aim/ext/transport/client.py:310)
    atomic_track (aim/sdk/repo.py:939)
    __exit__ (contextlib.py:142)
    _track (aim/sdk/tracker.py:120)
    __call__ (aim/sdk/tracker.py:104)
    track (aim/sdk/run.py:414)
    wrapper (aim/ext/exception_resistant.py:68)
    log_metrics (aim/sdk/adapters/pytorch_lightning.py:144)

Aug 23 '23 21:08 howieyoo

aim aim copied to clipboard

train progress not updating after some iteration

🐛 Bug

Environment

aim
aim copied to clipboard