wandb
wandb copied to clipboard
[CLI]: BrokenPipeError: [Errno 32] Broken pipe
Bub description
Training and logging run fine; however, at the end of the process, the wandb outputs the error message below.
wandb: Waiting for W&B process to finish... (success).
wandb: \ 0.014 MB of 0.014 MB uploaded (0.000 MB deduped)
wandb: Run history:
wandb: epoch ▁▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆████████
wandb: train_loss █▄▃▂▂▃▄▃▁▂▂▄▄▂▃▂▁▂▁▁▁▁▁▁▃▁▁▁▁▂▁▁▁▁▁▅▁▁▁▃
wandb: trainer/global_step ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb: val_Mac-F1 ▁▄▆▅█
wandb: val_Mic-F1 ▁▄▇▇█
wandb: val_Wei-F1 ▁▅▇▇█
wandb: val_loss █▂▁▄▆
wandb:
wandb: Run summary:
wandb: epoch 4
wandb: train_loss 0.47728
wandb: trainer/global_step 5534
wandb: val_Mac-F1 0.70413
wandb: val_Mic-F1 0.88889
wandb: val_Wei-F1 0.93459
wandb: val_loss 0.46428
wandb:
wandb: 🚀 View run BERT_WEBKB_0_exp at: https://wandb.ai/celsofranca/lightning_logs/runs/1qq5guxx
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: /tmp/wandb/run-20231012_134638-1qq5guxx/logs
Exception in thread Exception in thread IntMsgThrNetStatThr:
:
Traceback (most recent call last):
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 299, in check_internal_messages
self._target(*self._args, **self._kwargs)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 267, in check_network_status
self._loop_check_status(
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 223, in _loop_check_status
self._loop_check_status(
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 223, in _loop_check_status
local_handle = request()
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 743, in deliver_internal_messages
return self._deliver_internal_messages(internal_message)
local_handle = request()
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 481, in _deliver_internal_messages
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 735, in deliver_network_status
return self._deliver_record(record)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 428, in _deliver_record
return self._deliver_network_status(status)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 475, in _deliver_network_status
handle = mailbox._deliver_record(record, interface=self)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
return self._deliver_record(record)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 428, in _deliver_record
interface._publish(record)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
handle = mailbox._deliver_record(record, interface=self)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
self._sock_client.send_record_publish(record)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
interface._publish(record)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
self.send_server_request(server_req)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._sock_client.send_record_publish(record)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self._send_message(msg)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self.send_server_request(server_req)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._sendall_with_error_handle(header + data)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
self._send_message(msg)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
self._sendall_with_error_handle(header + data)
File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
Additional Files
No response
Environment
- WandB version: 0.15.12
- OS: Ubuntu 20.04
- Python version: Python 3.8.10
- Versions of relevant libraries: pytorch-lightning==2.0.9
Additional Context
No response
Hi @celsofranssa,
I'll be happy to assist you with this inquiry. We received this and we will investigate it and get back to you for updates.
Regards, Carlo Argel
i meet the same question, wait for fixing it.
Hi @celsofranssa
Reaching back from the support team. The error that you are encountering right now is a bit troubling. Can you provide the following please?
- Code snippet of how you are setting the job type
- Link to your run workspace if available
- The debug.log and debug-internal.log files of the failing run. These are located in your wandb working directory under wandb/
/logs
Regards, Carlo Argel
Hi @celsofranssa
Reaching back from the support team, I just want to follow up on the following items on the above thread.
Thank you, Carlo Argel
Hi @celsofranssa , since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!
Hello, I've run into the "Broken Pipe" issue this week as well, training with Ludwig. I'll rerun my training with wandb enabled today, and paste the backtrace once I have it.
Sorry for the delay, @Carlo-Argel. I've run the code again and have the full backtrace.
My use case is finetuning a Mistral 7B model with the ludwig
package. I'm using the built-in callback for Ludwig like this:
model = LudwigModel(
config=fine_tuning_config,
logging_level=logging.INFO,
callbacks=[WandbCallback()], # FIXME: This fails with "socket closed"
)
The training is initialized correctly, as well as W&B initialization. But after a couple of steps, the process crashes with wandb.sdk.lib.mailbox.MailboxError: transport failed
, from this line: https://github.com/wandb/wandb/blob/57d16d88197378c4803e63a7bcd5debe74bc8f33/wandb/sdk/lib/mailbox.py#L281 The initial call is from the Ludwig codebase here. The full backtrace is below.
Full backtrace
wandb.on_train_init() called...
Finishing last run (ID:e41q9vzv) before initializing another...
Problem at: /usr/local/lib/python3.10/dist-packages/ludwig/contribs/wandb.py 41 on_train_init
Training: 4%|▍ | 62/1625 [04:33<1:54:53, 4.41s/it, loss=nan]
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 2211, in _atexit_cleanup
self._on_finish()
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 2451, in _on_finish
_ = exit_handle.wait(timeout=-1, on_progress=self._on_progress_exit)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/mailbox.py", line 298, in wait
on_probe(probe_handle)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 2415, in _on_probe_exit
result = handle.wait(timeout=0, release=False)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/mailbox.py", line 281, in wait
raise MailboxError("transport failed")
wandb.sdk.lib.mailbox.MailboxError: transport failed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py", line 1166, in init
run = wi.init()
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py", line 599, in init
latest_run.finish()
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 420, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 361, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1953, in finish
return self._finish(exit_code, quiet)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1968, in _finish
self._atexit_cleanup(exit_code=exit_code)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 2222, in _atexit_cleanup
self._backend.cleanup()
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/backend/backend.py", line 232, in cleanup
self.interface.join()
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py", line 531, in join
super().join()
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface.py", line 703, in join
_ = self._communicate_shutdown()
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py", line 428, in _communicate_shutdown
_ = self._communicate(record)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py", line 294, in _communicate
return self._communicate_async(rec, local=local).get(timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_sock.py", line 60, in _communicate_async
future = self._router.send_and_receive(rec, local=local)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/router.py", line 94, in send_and_receive
self._send_message(rec)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/router_sock.py", line 36, in _send_message
self._sock_client.send_record_communicate(record)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 216, in send_record_communicate
self.send_server_request(server_req)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._send_message(msg)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
---------------------------------------------------------------------------
MailboxError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py in _atexit_cleanup(self, exit_code)
2210 try:
-> 2211 self._on_finish()
2212 except KeyboardInterrupt as ki:
25 frames
/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py in _on_finish(self)
2450
-> 2451 _ = exit_handle.wait(timeout=-1, on_progress=self._on_progress_exit)
2452
/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/mailbox.py in wait(self, timeout, on_probe, on_progress, release, cancel)
297 if on_probe and probe_handle:
--> 298 on_probe(probe_handle)
299 if on_progress and progress_handle:
/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py in _on_probe_exit(self, probe_handle)
2414 if handle:
-> 2415 result = handle.wait(timeout=0, release=False)
2416 if not result:
/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/mailbox.py in wait(self, timeout, on_probe, on_progress, release, cancel)
280 if self._interface._transport_keepalive_failed():
--> 281 raise MailboxError("transport failed")
282
MailboxError: transport failed
During handling of the above exception, another exception occurred:
BrokenPipeError Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py in init(job_type, dir, config, project, entity, reinit, tags, group, name, notes, magic, config_exclude_keys, config_include_keys, anonymous, mode, allow_val_change, resume, force, tensorboard, sync_tensorboard, monitor_gym, save_code, id, settings)
1165 try:
-> 1166 run = wi.init()
1167 except_exit = wi.settings._except_exit
/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py in init(self)
598
--> 599 latest_run.finish()
600
/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py in wrapper(self, *args, **kwargs)
419
--> 420 return func(self, *args, **kwargs)
421
/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py in wrapper(self, *args, **kwargs)
360 cls._is_attaching = ""
--> 361 return func(self, *args, **kwargs)
362
/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py in finish(self, exit_code, quiet)
1952 """
-> 1953 return self._finish(exit_code, quiet)
1954
/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py in _finish(self, exit_code, quiet)
1967
-> 1968 self._atexit_cleanup(exit_code=exit_code)
1969 if self._wl and len(self._wl._global_run_stack) > 0:
/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py in _atexit_cleanup(self, exit_code)
2221 self._console_stop()
-> 2222 self._backend.cleanup()
2223 logger.error("Problem finishing run", exc_info=e)
/usr/local/lib/python3.10/dist-packages/wandb/sdk/backend/backend.py in cleanup(self)
231 if self.interface:
--> 232 self.interface.join()
233 if self.wandb_process:
/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py in join(self)
530 def join(self) -> None:
--> 531 super().join()
532
/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface.py in join(self)
702 return
--> 703 _ = self._communicate_shutdown()
704
/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py in _communicate_shutdown(self)
427 record = self._make_record(request=request)
--> 428 _ = self._communicate(record)
429
/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py in _communicate(self, rec, timeout, local)
293 ) -> Optional[pb.Result]:
--> 294 return self._communicate_async(rec, local=local).get(timeout=timeout)
295
/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_sock.py in _communicate_async(self, rec, local)
59 raise Exception("The wandb backend process has shutdown")
---> 60 future = self._router.send_and_receive(rec, local=local)
61 return future
/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/router.py in send_and_receive(self, rec, local)
93
---> 94 self._send_message(rec)
95
/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/router_sock.py in _send_message(self, record)
35 def _send_message(self, record: "pb.Record") -> None:
---> 36 self._sock_client.send_record_communicate(record)
/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py in send_record_communicate(self, record)
215 server_req.record_communicate.CopyFrom(record)
--> 216 self.send_server_request(server_req)
217
/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py in send_server_request(self, msg)
154 def send_server_request(self, msg: Any) -> None:
--> 155 self._send_message(msg)
156
/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py in _send_message(self, msg)
151 with self._lock:
--> 152 self._sendall_with_error_handle(header + data)
153
/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py in _sendall_with_error_handle(self, data)
129 try:
--> 130 sent = self._sock.send(data)
131 # sent equal to 0 indicates a closed socket
BrokenPipeError: [Errno 32] Broken pipe
The above exception was the direct cause of the following exception:
Error Traceback (most recent call last)
<timed exec> in <module>
/usr/local/lib/python3.10/dist-packages/ludwig/api.py in train(self, dataset, training_set, validation_set, test_set, training_set_metadata, data_format, experiment_name, model_name, model_resume_path, skip_save_training_description, skip_save_training_statistics, skip_save_model, skip_save_progress, skip_save_log, skip_save_processed_input, output_directory, random_seed, **kwargs)
590
591 for callback in self.callbacks:
--> 592 callback.on_train_init(
593 base_config=self._user_config,
594 experiment_directory=output_directory,
/usr/local/lib/python3.10/dist-packages/ludwig/contribs/wandb.py in on_train_init(self, base_config, experiment_directory, experiment_name, model_name, output_directory, resume_directory)
39 ):
40 logger.info("wandb.on_train_init() called...")
---> 41 wandb.init(
42 project=os.getenv("WANDB_PROJECT", experiment_name),
43 name=model_name,
/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py in init(job_type, dir, config, project, entity, reinit, tags, group, name, notes, magic, config_exclude_keys, config_include_keys, anonymous, mode, allow_val_change, resume, force, tensorboard, sync_tensorboard, monitor_gym, save_code, id, settings)
1202 wandb.termerror("Abnormal program exit")
1203 os._exit(1)
-> 1204 raise Error("An unexpected error occurred") from error_seen
1205 return run
Error: An unexpected error occurred
The wandb
package version is the currently latest, 0.16.0
.
Is there some other detail I can provide? I can enable access to the run at wandb.ai.
because of this error my progress of nearly 3days stopped in between. Now i have to start again. Is there any alternative for this or handler for this or should i just store the progress locally
@Carlo-Argel
Hi, I have the same issue "BrokenPipeError: [Errno 32] Broken pipe"
Similar error. Current SDK version is 0.16.1
2023-12-30 21:07:18,143 INFO MainThread:3025339 [wandb_init.py:init():614] starting backend 2023-12-30 21:07:18,143 INFO MainThread:3025339 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn 2023-12-30 21:07:18,153 INFO MainThread:3025339 [backend.py:ensure_launched():206] starting backend process... 2023-12-30 21:07:18,156 INFO MainThread:3025339 [backend.py:ensure_launched():211] started backend process with pid: 3027702 2023-12-30 21:07:18,157 INFO MainThread:3025339 [wandb_init.py:init():624] backend started and connected 2023-12-30 21:07:18,163 INFO MainThread:3025339 [wandb_init.py:init():716] updated telemetry 2023-12-30 21:07:18,165 INFO MainThread:3025339 [wandb_init.py:init():749] communicating run to backend with 90.0 second timeout 2023-12-30 21:07:23,329 ERROR MainThread:3025339 [wandb_init.py:init():1188] transport failed Traceback (most recent call last): File "/home/user/.local/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1170, in init run = wi.init() File "/home/user/.local/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 752, in init result = run_init_handle.wait( File "/home/user/.local/lib/python3.10/site-packages/wandb/sdk/lib/mailbox.py", line 281, in wait raise MailboxError("transport failed") wandb.sdk.lib.mailbox.MailboxError: transport failed
I randomly get this error every now and then during training too, I assume it is related to networking issues. It would be great if any internal W&B issues wouldn't result in the run crashing.
{'loss': 55842.3375, 'learning_rate': 0.00019748020497041964, 'epoch': 1.32}
{'loss': 55757.2188, 'learning_rate': 0.0001974556426587668, 'epoch': 1.33}
9%|██████████▋ | 2587/29100 [1:22:29<526:27:27, 71.48s/it]
Exception in thread NetStatThr:
Traceback (most recent call last):
File "/home/user/mambaforge/envs/slt/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/home/user/mambaforge/envs/slt/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
File "/home/user/mambaforge/envs/slt/lib/python3.11/site-packages/wandb/sdk/wandb_run.py", line 268, in check_network_status
self._loop_check_status(
File "/home/user/mambaforge/envs/slt/lib/python3.11/site-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status
local_handle = request()
^^^^^^^^^
File "/home/user/mambaforge/envs/slt/lib/python3.11/site-packages/wandb/sdk/interface/interface.py", line 792, in deliver_network_status
return self._deliver_network_status(status)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/mambaforge/envs/slt/lib/python3.11/site-packages/wandb/sdk/interface/interface_shared.py", line 500, in _deliver_network_status
return self._deliver_record(record)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/mambaforge/envs/slt/lib/python3.11/site-packages/wandb/sdk/interface/interface_shared.py", line 449, in _deliver_record
handle = mailbox._deliver_record(record, interface=self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user/mambaforge/envs/slt/lib/python3.11/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
interface._publish(record)
File "/home/user/mambaforge/envs/slt/lib/python3.11/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
self._sock_client.send_record_publish(record)
File "/home/user/mambaforge/envs/slt/lib/python3.11/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self.send_server_request(server_req)
File "/home/user/mambaforge/envs/slt/lib/python3.11/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._send_message(msg)
File "/home/user/mambaforge/envs/slt/lib/python3.11/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
File "/home/user/mambaforge/envs/slt/lib/python3.11/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
^^^^^^^^^^^^^^^^^^^^^
BrokenPipeError: [Errno 32] Broken pipe
Killed```
Getting the same issue exactly as directly above
I'm also getting the same issue as above, however the run in wandb is finished (no errors) and all the data is in there too. I'm not sure how to interpret this error. It somehow also hangs the program indefinetely (although I'm not sure yet if this is a wandb issue or my queuing script).
I guess something important to note is that the queuing script I'm using makes a copy of my workspace in a temporary folder to be able to do multiprocessing. I'm not sure if this has any interaction with wandb, specially given that 95% of the runs are finished normally.
Edit: I've noticed that doing run.finish
helps with that and so far had no more errors like that. The hanging also does not seem related to wandb. I'm not sure yet.
similar issue at the end of the process but it does not affect other things
Any updates @Carlo-Argel? This issue is killing joy of wandb, and it is just bizarre it takes so long to fix it.
same issue here
I think one can see whether there is enough space for wandb logs and also check internet connection. In my case it was space issue.
Get Outlook for Androidhttps://aka.ms/AAb9ysg
From: Bola Khalil @.> Sent: Tuesday, March 19, 2024 9:03:31 PM To: wandb/wandb @.> Cc: Bafna Jainit Sushil @.>; Comment @.> Subject: Re: [wandb/wandb] [CLI]: BrokenPipeError: [Errno 32] Broken pipe (Issue #6449)
same issue here
— Reply to this email directly, view it on GitHubhttps://github.com/wandb/wandb/issues/6449#issuecomment-2007505419, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AWXKKIWUIOFRLKNXEIEZFULYZBLEXAVCNFSM6AAAAAA56GXAV2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXGUYDKNBRHE. You are receiving this because you commented.Message ID: @.***>
there is enough space. it only happens with sweeps.
same issue. i will no longer use wandb.
there is enough space. it only happens with sweeps.
@bolak92 could you please provide a small reproduction so we could help fix it? Thanks and sorry that you are experiencing this issue.
I think I might have found the reason. This error occurs when the script is modified while the program is running.
I think I might have found the reason. This error occurs when the script is modified while the program is running.
oh interesting, yeah that could cause the system to be in a bad state. in any case if you want us to look into it further, providing a reproduction will be the best way to help here
Hi guys,
I think I found a temporary solution.
For me it wasn't space issues.
but indeed, I believe it was the fact that the processes did not stop after the script has finished running. There seems to have been a way where wandb was automatically finishing those processes, but now that doesnt work properly.
What helped me was kiling the processes both on CPU obtained from top
(straight forward killing with the PID)
but also the not so obvious GPU processes (something I only learned because of this issue :) )
-
list the processes on gpu
lsof /dev/nvidia*
-
make sure that all the processes are yours and not some other users.
-
if it s all yours and you dont need them (you want to kill them all)
lsof /dev/nvidia* | awk '{print $2}' | xargs -I {} kill -9 {}
now reruning the script doesnt produce the error for me. I hope that helps.
Hi guys,
I think I found a temporary solution.
For me it wasn't space issues.
but indeed, I believe it was the fact that the processes did not stop after the script has finished running. There seems to have been a way where wandb was automatically finishing those processes, but now that doesnt work properly.
What helped me was kiling the processes both on CPU obtained from
top
(straight forward killing with the PID)but also the not so obvious GPU processes (something I only learned because of this issue :) )
- list the processes on gpu
lsof /dev/nvidia*
- make sure that all the processes are yours and not some other users.
- if it s all yours and you dont need them (you want to kill them all)
lsof /dev/nvidia* | awk '{print $2}' | xargs -I {} kill -9 {}
now reruning the script doesnt produce the error for me. I hope that helps.
Thanks for sharing your experience, it should be the case that the service suppose to finish all the active run when the main script completes (we use an atexit hook to trigger it), if it is not the case, it is a bug.
Ideally to always make sure your run was marked as completed adding run.finish
in the end of the runs usage should make sure that your run was completed.
do you think you could provide a reproduction of your script? i'm interested to learn why the nvidia processes still running and how we can better handle these cases
Also running into this error mid-training -- any ideas on how to solve it?
wandb: Find logs at: ./wandb/run-20240414_095858-2t5b2bol/logs
Exception in thread NetStatThr:
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Exception in thread ChkStopThr:
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 268, in check_network_status
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._loop_check_status(
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status
self._target(*self._args, **self._kwargs)
local_handle = request()
local_handle = request()
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 286, in check_stop_status
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface.py", line 848, in deliver_network_status
self._loop_check_status(
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status
local_handle = request()
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface.py", line 840, in deliver_stop_status
return self._deliver_network_status(status)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py", line 510, in _deliver_network_status
return self._deliver_stop_status(status)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py", line 494, in _deliver_stop_status
return self._deliver_record(record)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py", line 459, in _deliver_record
return self._deliver_record(record)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py", line 459, in _deliver_record
handle = mailbox._deliver_record(record, interface=self)
handle = mailbox._deliver_record(record, interface=self)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
interface._publish(record)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
interface._publish(record)
self._sock_client.send_record_publish(record)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self._sock_client.send_record_publish(record)
self.send_server_request(server_req)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self.send_server_request(server_req)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._send_message(msg)
self._send_message(msg)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
self._sendall_with_error_handle(header + data)
BrokenPipeError: [Errno 32] Broken pipe
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
@pminervini could you make sure that you don't have WANDB_SERVICE
env variable set and kill all potential wandb-service process:
ps -ef | grep wandb-service
to get a list of these processes.
Once you clean your env, everything should work.
@pminervini could you make sure that you don't have
WANDB_SERVICE
env variable set and kill all potential wandb-service process:ps -ef | grep wandb-service
to get a list of these processes. Once you clean your env, everything should work.
@kptkin how does just adding wandb.finish()
at the end of the script look to you? That would take care of dangling processes, no?
I found that when I launch a Weights & Biases (wandb) service with simulated data alone, there are no issues with the service communication. However, when I simultaneously load a model on the GPU, the wandb service immediately stops (with the same error as mentioned above). If I restart the wandb service at this point, I notice that it will automatically stop after a fixed period (about 1 minute). Could this be related to the load balancer?
Training /chenhui/zhangwuhan/stage2/trained_model/qwen1.5_7b_5_5e-5_2_1k_plugin 0
0%| | 0/14272 [00:06<?, ?it/s, train_loss=4.85]2024-04-27 17:18:41,774 - DEBUG - Successfully logged to WandB
0%| | 1/14272 [00:11<25:10:01, 6.35s/it, train_loss=2.9]2024-04-27 17:18:47,054 - DEBUG - Successfully logged to WandB
0%| | 2/14272 [00:16<22:40:19, 5.72s/it, train_loss=2.25]2024-04-27 17:18:52,202 - DEBUG - Successfully logged to WandB
0%| | 3/14272 [00:22<21:38:07, 5.46s/it, train_loss=2.09]2024-04-27 17:18:57,456 - DEBUG - Successfully logged to WandB
0%| | 4/14272 [00:27<21:18:49, 5.38s/it, train_loss=2.02]2024-04-27 17:19:02,601 - DEBUG - Successfully logged to WandB
0%| | 5/14272 [00:32<20:58:47, 5.29s/it, train_loss=1.88]2024-04-27 17:19:07,853 - DEBUG - Successfully logged to WandB
0%| | 6/14272 [00:38<25:34:01, 6.45s/it, train_loss=1.87]
Traceback (most recent call last):
File "/chenhui/zhangwuhan/stage2/FastChat/fastchat/train/qwen1.5_7b_5_5e-5_2_1k_plugin.py", line 290, in <module>
main()
File "/chenhui/zhangwuhan/stage2/FastChat/fastchat/train/qwen1.5_7b_5_5e-5_2_1k_plugin.py", line 283, in main
accelerator.log({"train_loss": loss.item()}, step=batch_idx)
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 602, in _inner
return PartialState().on_main_process(function)(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2267, in log
tracker.log(values, step=step, **log_kwargs.get(tracker.name, {}))
File "/usr/local/lib/python3.10/dist-packages/accelerate/tracking.py", line 86, in execute_on_main_process
return function(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/tracking.py", line 333, in log
self.run.log(values, step=step, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 420, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 371, in wrapper_fn
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 361, in wrapper
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1838, in log
self._log(data=data, step=step, commit=commit)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1602, in _log
self._partial_history_callback(data, step, commit)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1474, in _partial_history_callback
self._backend.interface.publish_partial_history(
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface.py", line 602, in publish_partial_history
self._publish_partial_history(partial_history)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py", line 89, in _publish_partial_history
self._publish(rec)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
self._sock_client.send_record_publish(record)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self.send_server_request(server_req)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._send_message(msg)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
wandb: While tearing down the service manager. The following error has occurred: [Errno 32] Broken pipe
2024-04-27 17:19:14,144 - DEBUG - Starting new HTTPS connection (1): o151352.ingest.sentry.io:443
2024-04-27 17:19:16,144 - DEBUG - Attempting to acquire lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-04-27 17:19:16,144 - DEBUG - Lock 140175619989824 acquired on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-04-27 17:19:16,145 - DEBUG - Attempting to release lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-04-27 17:19:16,145 - DEBUG - Lock 140175619989824 released on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-04-27 17:19:16,145 - DEBUG - Attempting to acquire lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
2024-04-27 17:19:16,145 - DEBUG - Lock 140175619989824 acquired on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
2024-04-27 17:19:16,146 - DEBUG - Attempting to release lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
2024-04-27 17:19:16,146 - DEBUG - Lock 140175619989824 released on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
I hope the official can resolve this issue as soon as possible.
@pminervini could you make sure that you don't have
WANDB_SERVICE
env variable set and kill all potential wandb-service process:ps -ef | grep wandb-service
to get a list of these processes. Once you clean your env, everything should work.@kptkin how does just adding
wandb.finish()
at the end of the script look to you? That would take care of dangling processes, no?
wandb.finish()
will close the last active run, and hopefully close the service if it is the only run in the service, but if the sevice is left behind, the current process is not aware of it. We have a fix that should handle this, but it is not merged yet, hopefully in one of the upcoming releases.
I found that when I launch a Weights & Biases (wandb) service with simulated data alone, there are no issues with the service communication. However, when I simultaneously load a model on the GPU, the wandb service immediately stops (with the same error as mentioned above). If I restart the wandb service at this point, I notice that it will automatically stop after a fixed period (about 1 minute). Could this be related to the load balancer?
Training /chenhui/zhangwuhan/stage2/trained_model/qwen1.5_7b_5_5e-5_2_1k_plugin 0 0%| | 0/14272 [00:06<?, ?it/s, train_loss=4.85]2024-04-27 17:18:41,774 - DEBUG - Successfully logged to WandB 0%| | 1/14272 [00:11<25:10:01, 6.35s/it, train_loss=2.9]2024-04-27 17:18:47,054 - DEBUG - Successfully logged to WandB 0%| | 2/14272 [00:16<22:40:19, 5.72s/it, train_loss=2.25]2024-04-27 17:18:52,202 - DEBUG - Successfully logged to WandB 0%| | 3/14272 [00:22<21:38:07, 5.46s/it, train_loss=2.09]2024-04-27 17:18:57,456 - DEBUG - Successfully logged to WandB 0%| | 4/14272 [00:27<21:18:49, 5.38s/it, train_loss=2.02]2024-04-27 17:19:02,601 - DEBUG - Successfully logged to WandB 0%| | 5/14272 [00:32<20:58:47, 5.29s/it, train_loss=1.88]2024-04-27 17:19:07,853 - DEBUG - Successfully logged to WandB 0%| | 6/14272 [00:38<25:34:01, 6.45s/it, train_loss=1.87] Traceback (most recent call last): File "/chenhui/zhangwuhan/stage2/FastChat/fastchat/train/qwen1.5_7b_5_5e-5_2_1k_plugin.py", line 290, in <module> main() File "/chenhui/zhangwuhan/stage2/FastChat/fastchat/train/qwen1.5_7b_5_5e-5_2_1k_plugin.py", line 283, in main accelerator.log({"train_loss": loss.item()}, step=batch_idx) File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 602, in _inner return PartialState().on_main_process(function)(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2267, in log tracker.log(values, step=step, **log_kwargs.get(tracker.name, {})) File "/usr/local/lib/python3.10/dist-packages/accelerate/tracking.py", line 86, in execute_on_main_process return function(self, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/accelerate/tracking.py", line 333, in log self.run.log(values, step=step, **kwargs) File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 420, in wrapper return func(self, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 371, in wrapper_fn return func(self, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 361, in wrapper return func(self, *args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1838, in log self._log(data=data, step=step, commit=commit) File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1602, in _log self._partial_history_callback(data, step, commit) File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1474, in _partial_history_callback self._backend.interface.publish_partial_history( File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface.py", line 602, in publish_partial_history self._publish_partial_history(partial_history) File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py", line 89, in _publish_partial_history self._publish(rec) File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish self._sock_client.send_record_publish(record) File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish self.send_server_request(server_req) File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request self._send_message(msg) File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message self._sendall_with_error_handle(header + data) File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle sent = self._sock.send(data) BrokenPipeError: [Errno 32] Broken pipe wandb: While tearing down the service manager. The following error has occurred: [Errno 32] Broken pipe 2024-04-27 17:19:14,144 - DEBUG - Starting new HTTPS connection (1): o151352.ingest.sentry.io:443 2024-04-27 17:19:16,144 - DEBUG - Attempting to acquire lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock 2024-04-27 17:19:16,144 - DEBUG - Lock 140175619989824 acquired on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock 2024-04-27 17:19:16,145 - DEBUG - Attempting to release lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock 2024-04-27 17:19:16,145 - DEBUG - Lock 140175619989824 released on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock 2024-04-27 17:19:16,145 - DEBUG - Attempting to acquire lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock 2024-04-27 17:19:16,145 - DEBUG - Lock 140175619989824 acquired on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock 2024-04-27 17:19:16,146 - DEBUG - Attempting to release lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock 2024-04-27 17:19:16,146 - DEBUG - Lock 140175619989824 released on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
I hope the official can resolve this issue as soon as possible.
@endNone from your description it seems like some bad interaction of your integration and wandb, if you could provide a reproduction script, it will be helpful for us to actually debug it and find a fix for it. Thanks for all the above information. Broken pipe just means that service is not reachable, if you provide a repro we could figure out what went wrong.