Connection stuck without any error message
I have used W&B local for years. Recently, I sometime get stuck when finish training run.
There is no error or warning messages at all, the last terminal message I got on client side is:
Terminal Messages
wandb: Synced RUN_NAME: SERVER_ADDRESS
wandb: Synced 7 W&B file(s), 58800 media file(s), 0 artifact file(s) and 3 other file(s)
wandb: Find logs at: ../artifacts/wandb/run-20220730_190417-2ql6zqpk/logs
Environment & Version
My local instance is running on Ubuntu 16.04, and its version is 0.15.0.
My client side is running on Ubuntu 16.04 or 18.04, and its version is 0.12.21.
I really enjoy the experience of using W&B local, thank you guys for develop this awesome MLOps tool. And I hope this issue can be reproduced and solved soon.
Looks like you logged 58800 media files like images or video. That will take a long time to upload and might fill up our overwhelm your disk. You should reduce the number of media logged or purchase a license for a commercial version that can connect to cloud storage.
The total size of these media files is not heavy. It's about 75 MB.
The Synchronization stuck also happen to a run without any media file.
You can find details about what our process is doing by looking at the wandb/debug-internal.log process relative to your script. We would need to see that to understand what's making the process stall.
This is the log in wandb/debug-internal.log. I copy the INFO and DEBUG right after last scan save is logged.
2022-08-16 12:36:41,934 INFO SenderThread:4422 [sender.py:transition_state():459] send defer: 8
2022-08-16 12:36:41,935 DEBUG SenderThread:4422 [sender.py:send_request():316] send_request: poll_exit
2022-08-16 12:36:41,941 DEBUG HandlerThread:4422 [handler.py:handle_request():141] handle_request: defer
2022-08-16 12:36:41,941 INFO HandlerThread:4422 [handler.py:handle_request_defer():164] handle defer: 8
2022-08-16 12:36:41,942 DEBUG SenderThread:4422 [sender.py:send_request():316] send_request: defer
2022-08-16 12:36:41,942 INFO SenderThread:4422 [sender.py:send_request_defer():455] handle sender defer: 8
2022-08-16 12:36:41,942 INFO SenderThread:4422 [file_pusher.py:finish():171] shutting down file pusher
2022-08-16 12:36:42,044 DEBUG HandlerThread:4422 [handler.py:handle_request():141] handle_request: poll_exit
2022-08-16 12:36:42,044 DEBUG SenderThread:4422 [sender.py:send_request():316] send_request: poll_exit
2022-08-16 12:36:42,142 INFO Thread-11 :4422 [sender.py:transition_state():459] send defer: 9
2022-08-16 12:36:42,143 DEBUG HandlerThread:4422 [handler.py:handle_request():141] handle_request: defer
2022-08-16 12:36:42,143 INFO HandlerThread:4422 [handler.py:handle_request_defer():164] handle defer: 9
2022-08-16 12:36:42,143 DEBUG SenderThread:4422 [sender.py:send_request():316] send_request: defer
2022-08-16 12:36:42,143 INFO SenderThread:4422 [sender.py:send_request_defer():455] handle sender defer: 9
2022-08-16 12:36:42,154 DEBUG HandlerThread:4422 [handler.py:handle_request():141] handle_request: poll_exit
2022-08-16 12:36:42,213 INFO SenderThread:4422 [sender.py:transition_state():459] send defer: 10
2022-08-16 12:36:42,214 DEBUG SenderThread:4422 [sender.py:send_request():316] send_request: poll_exit
2022-08-16 12:36:42,229 DEBUG HandlerThread:4422 [handler.py:handle_request():141] handle_request: defer
2022-08-16 12:36:42,229 INFO HandlerThread:4422 [handler.py:handle_request_defer():164] handle defer: 10
2022-08-16 12:36:42,229 DEBUG SenderThread:4422 [sender.py:send_request():316] send_request: defer
2022-08-16 12:36:42,230 INFO SenderThread:4422 [sender.py:send_request_defer():455] handle sender defer: 10
2022-08-16 12:36:42,230 INFO SenderThread:4422 [sender.py:transition_state():459] send defer: 11
2022-08-16 12:36:42,231 DEBUG SenderThread:4422 [sender.py:send():302] send: final
2022-08-16 12:36:42,231 DEBUG HandlerThread:4422 [handler.py:handle_request():141] handle_request: defer
2022-08-16 12:36:42,231 DEBUG SenderThread:4422 [sender.py:send():302] send: footer
2022-08-16 12:36:42,231 INFO HandlerThread:4422 [handler.py:handle_request_defer():164] handle defer: 11
2022-08-16 12:36:42,232 DEBUG SenderThread:4422 [sender.py:send_request():316] send_request: defer
2022-08-16 12:36:42,232 INFO SenderThread:4422 [sender.py:send_request_defer():455] handle sender defer: 11
2022-08-16 12:36:42,332 DEBUG HandlerThread:4422 [handler.py:handle_request():141] handle_request: poll_exit
2022-08-16 12:36:42,335 DEBUG SenderThread:4422 [sender.py:send_request():316] send_request: poll_exit
2022-08-16 12:36:42,355 INFO SenderThread:4422 [file_pusher.py:join():176] waiting for file pusher
2022-08-16 12:36:42,474 DEBUG HandlerThread:4422 [handler.py:handle_request():141] handle_request: sampled_history
2022-08-16 12:36:42,489 DEBUG HandlerThread:4422 [handler.py:handle_request():141] handle_request: get_summary
2022-08-16 12:36:42,491 DEBUG HandlerThread:4422 [handler.py:handle_request():141] handle_request: shutdown
2022-08-16 12:36:42,491 INFO HandlerThread:4422 [handler.py:finish():810] shutting down handler
2022-08-16 12:36:43,231 INFO WriterThread:4422 [datastore.py:close():279] close: ../artifacts/wandb/run-20220815_150813-2jk5mgtn/run-2jk5mgtn.wandb
2022-08-16 12:36:43,372 INFO SenderThread:4422 [sender.py:finish():1312] shutting down sender
2022-08-16 12:36:43,372 INFO SenderThread:4422 [file_pusher.py:finish():171] shutting down file pusher
2022-08-16 12:36:43,372 INFO SenderThread:4422 [file_pusher.py:join():176] waiting for file pusher
2022-08-16 12:36:48,400 INFO MainThread:4422 [internal.py:handle_exit():80] Internal process exited
Thank you for your time, I hope this issue can solved soon.
I got different message today as following, just for your reference.
2022-08-23 10:29:30,880 INFO SenderThread:15179 [sender.py:transition_state():459] send defer: 8
2022-08-23 10:29:30,880 INFO SenderThread:15179 [sender.py:finish():1312] shutting down sender
2022-08-23 10:29:30,880 INFO SenderThread:15179 [file_pusher.py:finish():171] shutting down file pusher
2022-08-23 10:29:30,880 INFO SenderThread:15179 [file_pusher.py:join():176] waiting for file pusher
2022-08-23 10:29:31,006 INFO WriterThread:15179 [datastore.py:close():279] close: ../artifacts/wandb/run-20220822_185928-2l3ms0qk/run-2l3ms0qk.wandb
2022-08-23 10:29:31,452 ERROR StreamThr :15179 [internal.py:wandb_internal():165] Thread HandlerThread:
Traceback (most recent call last):
File "/home/jiayau/anaconda3/envs/tf-2.7/lib/python3.8/site-packages/wandb/sdk/internal/internal_util.py", line 51, in run
self._run()
File "/home/jiayau/anaconda3/envs/tf-2.7/lib/python3.8/site-packages/wandb/sdk/internal/internal_util.py", line 98, in _run
record = self._input_record_q.get(timeout=1)
File "/home/jiayau/anaconda3/envs/tf-2.7/lib/python3.8/multiprocessing/queues.py", line 111, in get
res = self._recv_bytes()
File "/home/jiayau/anaconda3/envs/tf-2.7/lib/python3.8/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/jiayau/anaconda3/envs/tf-2.7/lib/python3.8/multiprocessing/connection.py", line 414, in _recv_bytes
buf = self._recv(4)
File "/home/jiayau/anaconda3/envs/tf-2.7/lib/python3.8/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError