aim icon indicating copy to clipboard operation
aim copied to clipboard

train progress not updating after some iteration

Open igor-byel opened this issue 1 year ago • 8 comments

🐛 Bug

After some train iteration training progress not updating in UI AIM UI Screenshot from 2023-04-24 14-33-46 console progress Screenshot from 2023-04-24 14-34-14

Environment

  • Aim Version 3.17.3
  • Python version 3.10
  • pip version 22.3.1
  • OS Ubuntu 18.04 LTS
  • torch 2.0.0
  • pytorch lightning 2.0.1

igor-byel avatar Apr 24 '23 11:04 igor-byel

Hey @igor-byel! Do you see any errors/warnings in terminal running aim up? Is this a random issue or it happens consistently?

alberttorosyan avatar Apr 24 '23 11:04 alberttorosyan

Hey @igor-byel! Do you see any errors/warnings in terminal running aim up? Is this a random issue or it happens consistently?

Hi alberttorosyan

  • "Do you see any errors/warnings in terminal running aim up?"- No i do not see any error in console nor Aim up nor aim server nor on the client side

Screenshot from 2023-04-24 15-00-04

  • "Is this a random issue or it happens consistently?"-No sometimes it works as needed.

  • also my guess it is somehow correlates with updating frequence and number of updates client doing

igor-byel avatar Apr 24 '23 12:04 igor-byel

@igor-byel got it! Thanks for the additional info.

@mihran113, seems issue is related to remote tracking. Can you please take a look? Do you recall similar issues happening?

alberttorosyan avatar Apr 24 '23 12:04 alberttorosyan

Hey @igor-byel! The messages on server side indicate that some runs were terminated forcefully, or the network was gone for a long period of time. Is the client process log level on warning? Might there be a case that client side warnings haven't been displayed?

it should have been something like this:

'Network connection between client `{}` and server `{}` appears to be absent.'

mihran113 avatar Apr 25 '23 13:04 mihran113

Network connection between client

Hi @mihran113 i did not see any warnings in the client progress log.I will check farther and if i will see something i will update you

igor-byel avatar Apr 26 '23 16:04 igor-byel

Hi guys hope it will help i received such error from one train session

Remote Server is unavailable, please check network connection: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "Broken pipe" debug_error_string = "UNKNOWN:Error received from peer {created_time:"2023-05-02T12:04:53.795531307+03:00", grpc_status:14, grpc_message:"Broken pipe"}"

and in another one i get

Exception in thread Thread-3 (worker): Traceback (most recent call last): File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/home/igor/projects/ocr_ml_pipeline/venv/lib/python3.10/site-packages/aim/ext/transport/rpc_queue.py", line 55, in worker if self._try_exec_task(task_f, *args): File "/home/igor/projects/ocr_ml_pipeline/venv/lib/python3.10/site-packages/aim/ext/transport/rpc_queue.py", line 81, in _try_exec_task task_f(*args) File "/home/igor/projects/ocr_ml_pipeline/venv/lib/python3.10/site-packages/aim/ext/transport/client.py", line 299, in _run_write_instructions raise_exception(response.exception) File "/home/igor/projects/ocr_ml_pipeline/venv/lib/python3.10/site-packages/aim/ext/transport/message_utils.py", line 76, in raise_exception raise exception(*args) if args else exception() aim.ext.transport.message_utils.UnauthorizedRequestError:

igor-byel avatar May 02 '23 09:05 igor-byel

I'm experiencing similar errors - are there any solutions on some branches off develop? this one is a deal breaker for us...

hstojic avatar Aug 17 '23 08:08 hstojic

We get the same UnauthorizedRequestError thrown, and our training thread will block indefinitely trying to push on to the RPC queue:

Thread 0x7F1339FEB480 (idle): "MainThread"
    wait (threading.py:320)
    register_task (aim/ext/transport/rpc_queue.py:42)
    flush_instructions_batch (aim/ext/transport/client.py:310)
    atomic_track (aim/sdk/repo.py:939)
    __exit__ (contextlib.py:142)
    _track (aim/sdk/tracker.py:120)
    __call__ (aim/sdk/tracker.py:104)
    track (aim/sdk/run.py:414)
    wrapper (aim/ext/exception_resistant.py:68)
    log_metrics (aim/sdk/adapters/pytorch_lightning.py:144)

howieyoo avatar Aug 23 '23 21:08 howieyoo