aim
aim copied to clipboard
train progress not updating after some iteration
🐛 Bug
After some train iteration training progress not updating in UI
AIM UI
console progress
Environment
- Aim Version 3.17.3
- Python version 3.10
- pip version 22.3.1
- OS Ubuntu 18.04 LTS
- torch 2.0.0
- pytorch lightning 2.0.1
Hey @igor-byel!
Do you see any errors/warnings in terminal running aim up
? Is this a random issue or it happens consistently?
Hey @igor-byel! Do you see any errors/warnings in terminal running
aim up
? Is this a random issue or it happens consistently?
- "Do you see any errors/warnings in terminal running aim up?"- No i do not see any error in console nor Aim up nor aim server nor on the client side
-
"Is this a random issue or it happens consistently?"-No sometimes it works as needed.
-
also my guess it is somehow correlates with updating frequence and number of updates client doing
@igor-byel got it! Thanks for the additional info.
@mihran113, seems issue is related to remote tracking. Can you please take a look? Do you recall similar issues happening?
Hey @igor-byel! The messages on server side indicate that some runs were terminated forcefully, or the network was gone for a long period of time. Is the client process log level on warning? Might there be a case that client side warnings haven't been displayed?
it should have been something like this:
'Network connection between client `{}` and server `{}` appears to be absent.'
Network connection between client
Hi @mihran113 i did not see any warnings in the client progress log.I will check farther and if i will see something i will update you
Hi guys hope it will help i received such error from one train session
Remote Server is unavailable, please check network connection: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "Broken pipe" debug_error_string = "UNKNOWN:Error received from peer {created_time:"2023-05-02T12:04:53.795531307+03:00", grpc_status:14, grpc_message:"Broken pipe"}"
and in another one i get
Exception in thread Thread-3 (worker): Traceback (most recent call last): File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/home/igor/projects/ocr_ml_pipeline/venv/lib/python3.10/site-packages/aim/ext/transport/rpc_queue.py", line 55, in worker if self._try_exec_task(task_f, *args): File "/home/igor/projects/ocr_ml_pipeline/venv/lib/python3.10/site-packages/aim/ext/transport/rpc_queue.py", line 81, in _try_exec_task task_f(*args) File "/home/igor/projects/ocr_ml_pipeline/venv/lib/python3.10/site-packages/aim/ext/transport/client.py", line 299, in _run_write_instructions raise_exception(response.exception) File "/home/igor/projects/ocr_ml_pipeline/venv/lib/python3.10/site-packages/aim/ext/transport/message_utils.py", line 76, in raise_exception raise exception(*args) if args else exception() aim.ext.transport.message_utils.UnauthorizedRequestError:
I'm experiencing similar errors - are there any solutions on some branches off develop? this one is a deal breaker for us...
We get the same UnauthorizedRequestError
thrown, and our training thread will block indefinitely trying to push on to the RPC queue:
Thread 0x7F1339FEB480 (idle): "MainThread"
wait (threading.py:320)
register_task (aim/ext/transport/rpc_queue.py:42)
flush_instructions_batch (aim/ext/transport/client.py:310)
atomic_track (aim/sdk/repo.py:939)
__exit__ (contextlib.py:142)
_track (aim/sdk/tracker.py:120)
__call__ (aim/sdk/tracker.py:104)
track (aim/sdk/run.py:414)
wrapper (aim/ext/exception_resistant.py:68)
log_metrics (aim/sdk/adapters/pytorch_lightning.py:144)