wandb icon indicating copy to clipboard operation
wandb copied to clipboard

[CLI]: BrokenPipeError: [Errno 32] Broken pipe

Open celsofranssa opened this issue 9 months ago • 40 comments

Bub description

Training and logging run fine; however, at the end of the process, the wandb outputs the error message below.

wandb: Waiting for W&B process to finish... (success).
wandb: \ 0.014 MB of 0.014 MB uploaded (0.000 MB deduped)
wandb: Run history:
wandb:               epoch ▁▁▁▁▁▁▁▁▃▃▃▃▃▃▃▃▅▅▅▅▅▅▅▅▆▆▆▆▆▆▆▆████████
wandb:          train_loss █▄▃▂▂▃▄▃▁▂▂▄▄▂▃▂▁▂▁▁▁▁▁▁▃▁▁▁▁▂▁▁▁▁▁▅▁▁▁▃
wandb: trainer/global_step ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb:          val_Mac-F1 ▁▄▆▅█
wandb:          val_Mic-F1 ▁▄▇▇█
wandb:          val_Wei-F1 ▁▅▇▇█
wandb:            val_loss █▂▁▄▆
wandb: 
wandb: Run summary:
wandb:               epoch 4
wandb:          train_loss 0.47728
wandb: trainer/global_step 5534
wandb:          val_Mac-F1 0.70413
wandb:          val_Mic-F1 0.88889
wandb:          val_Wei-F1 0.93459
wandb:            val_loss 0.46428
wandb: 
wandb: 🚀 View run BERT_WEBKB_0_exp at: https://wandb.ai/celsofranca/lightning_logs/runs/1qq5guxx
wandb: Synced 5 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: /tmp/wandb/run-20231012_134638-1qq5guxx/logs
Exception in thread Exception in thread IntMsgThrNetStatThr:
:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 299, in check_internal_messages
    self._target(*self._args, **self._kwargs)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 267, in check_network_status
    self._loop_check_status(
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 223, in _loop_check_status
    self._loop_check_status(
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/wandb_run.py", line 223, in _loop_check_status
    local_handle = request()
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 743, in deliver_internal_messages
    return self._deliver_internal_messages(internal_message)
    local_handle = request()
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 481, in _deliver_internal_messages
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/interface/interface.py", line 735, in deliver_network_status
    return self._deliver_record(record)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 428, in _deliver_record
    return self._deliver_network_status(status)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 475, in _deliver_network_status
    handle = mailbox._deliver_record(record, interface=self)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
    return self._deliver_record(record)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/interface/interface_shared.py", line 428, in _deliver_record
    interface._publish(record)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
    handle = mailbox._deliver_record(record, interface=self)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
    self._sock_client.send_record_publish(record)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
    interface._publish(record)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
    self.send_server_request(server_req)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
    self._sock_client.send_record_publish(record)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
    self._send_message(msg)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
    self.send_server_request(server_req)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
    self._sendall_with_error_handle(header + data)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
    self._send_message(msg)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
    sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
    self._sendall_with_error_handle(header + data)
  File "/home/celso/projects/venvs/LightningPrototype/lib/python3.8/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
    sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe



Additional Files

No response

Environment

  • WandB version: 0.15.12
  • OS: Ubuntu 20.04
  • Python version: Python 3.8.10
  • Versions of relevant libraries: pytorch-lightning==2.0.9

Additional Context

No response

celsofranssa avatar Oct 12 '23 21:10 celsofranssa

Hi @celsofranssa,

I'll be happy to assist you with this inquiry. We received this and we will investigate it and get back to you for updates.

Regards, Carlo Argel

Carlo-Argel avatar Oct 13 '23 00:10 Carlo-Argel

i meet the same question, wait for fixing it.

callanwu avatar Oct 14 '23 14:10 callanwu

Hi @celsofranssa

Reaching back from the support team. The error that you are encountering right now is a bit troubling. Can you provide the following please?

  1. Code snippet of how you are setting the job type
  2. Link to your run workspace if available
  3. The debug.log and debug-internal.log files of the failing run. These are located in your wandb working directory under wandb//logs

Regards, Carlo Argel

Carlo-Argel avatar Oct 19 '23 00:10 Carlo-Argel

Hi @celsofranssa

Reaching back from the support team, I just want to follow up on the following items on the above thread.

Thank you, Carlo Argel

Carlo-Argel avatar Oct 24 '23 05:10 Carlo-Argel

Hi @celsofranssa , since we have not heard back from you we are going to close this request. If you would like to re-open the conversation, please let us know!

Carlo-Argel avatar Oct 25 '23 01:10 Carlo-Argel

Hello, I've run into the "Broken Pipe" issue this week as well, training with Ludwig. I'll rerun my training with wandb enabled today, and paste the backtrace once I have it.

karmi avatar Nov 15 '23 07:11 karmi

Sorry for the delay, @Carlo-Argel. I've run the code again and have the full backtrace.

My use case is finetuning a Mistral 7B model with the ludwig package. I'm using the built-in callback for Ludwig like this:

model = LudwigModel(
    config=fine_tuning_config,
    logging_level=logging.INFO,
    callbacks=[WandbCallback()], # FIXME: This fails with "socket closed"
)

The training is initialized correctly, as well as W&B initialization. But after a couple of steps, the process crashes with wandb.sdk.lib.mailbox.MailboxError: transport failed, from this line: https://github.com/wandb/wandb/blob/57d16d88197378c4803e63a7bcd5debe74bc8f33/wandb/sdk/lib/mailbox.py#L281 The initial call is from the Ludwig codebase here. The full backtrace is below.

Full backtrace
wandb.on_train_init() called...
Finishing last run (ID:e41q9vzv) before initializing another...
Problem at: /usr/local/lib/python3.10/dist-packages/ludwig/contribs/wandb.py 41 on_train_init
Training:   4%|▍         | 62/1625 [04:33<1:54:53,  4.41s/it, loss=nan]
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 2211, in _atexit_cleanup
    self._on_finish()
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 2451, in _on_finish
    _ = exit_handle.wait(timeout=-1, on_progress=self._on_progress_exit)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/mailbox.py", line 298, in wait
    on_probe(probe_handle)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 2415, in _on_probe_exit
    result = handle.wait(timeout=0, release=False)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/mailbox.py", line 281, in wait
    raise MailboxError("transport failed")
wandb.sdk.lib.mailbox.MailboxError: transport failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py", line 1166, in init
    run = wi.init()
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py", line 599, in init
    latest_run.finish()
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 420, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 361, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1953, in finish
    return self._finish(exit_code, quiet)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1968, in _finish
    self._atexit_cleanup(exit_code=exit_code)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 2222, in _atexit_cleanup
    self._backend.cleanup()
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/backend/backend.py", line 232, in cleanup
    self.interface.join()
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py", line 531, in join
    super().join()
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface.py", line 703, in join
    _ = self._communicate_shutdown()
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py", line 428, in _communicate_shutdown
    _ = self._communicate(record)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py", line 294, in _communicate
    return self._communicate_async(rec, local=local).get(timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_sock.py", line 60, in _communicate_async
    future = self._router.send_and_receive(rec, local=local)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/router.py", line 94, in send_and_receive
    self._send_message(rec)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/router_sock.py", line 36, in _send_message
    self._sock_client.send_record_communicate(record)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 216, in send_record_communicate
    self.send_server_request(server_req)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
    self._send_message(msg)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
    self._sendall_with_error_handle(header + data)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
    sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
---------------------------------------------------------------------------
MailboxError                              Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py in _atexit_cleanup(self, exit_code)
  2210         try:
-> 2211             self._on_finish()
  2212         except KeyboardInterrupt as ki:

25 frames
/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py in _on_finish(self)
  2450 
-> 2451         _ = exit_handle.wait(timeout=-1, on_progress=self._on_progress_exit)
  2452 

/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/mailbox.py in wait(self, timeout, on_probe, on_progress, release, cancel)
    297             if on_probe and probe_handle:
--> 298                 on_probe(probe_handle)
    299             if on_progress and progress_handle:

/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py in _on_probe_exit(self, probe_handle)
  2414         if handle:
-> 2415             result = handle.wait(timeout=0, release=False)
  2416             if not result:

/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/mailbox.py in wait(self, timeout, on_probe, on_progress, release, cancel)
    280                 if self._interface._transport_keepalive_failed():
--> 281                     raise MailboxError("transport failed")
    282 

MailboxError: transport failed

During handling of the above exception, another exception occurred:

BrokenPipeError                           Traceback (most recent call last)
/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py in init(job_type, dir, config, project, entity, reinit, tags, group, name, notes, magic, config_exclude_keys, config_include_keys, anonymous, mode, allow_val_change, resume, force, tensorboard, sync_tensorboard, monitor_gym, save_code, id, settings)
  1165         try:
-> 1166             run = wi.init()
  1167             except_exit = wi.settings._except_exit

/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py in init(self)
    598 
--> 599                 latest_run.finish()
    600 

/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py in wrapper(self, *args, **kwargs)
    419 
--> 420             return func(self, *args, **kwargs)
    421 

/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py in wrapper(self, *args, **kwargs)
    360                 cls._is_attaching = ""
--> 361             return func(self, *args, **kwargs)
    362 

/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py in finish(self, exit_code, quiet)
  1952         """
-> 1953         return self._finish(exit_code, quiet)
  1954 

/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py in _finish(self, exit_code, quiet)
  1967 
-> 1968         self._atexit_cleanup(exit_code=exit_code)
  1969         if self._wl and len(self._wl._global_run_stack) > 0:

/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py in _atexit_cleanup(self, exit_code)
  2221             self._console_stop()
-> 2222             self._backend.cleanup()
  2223             logger.error("Problem finishing run", exc_info=e)

/usr/local/lib/python3.10/dist-packages/wandb/sdk/backend/backend.py in cleanup(self)
    231         if self.interface:
--> 232             self.interface.join()
    233         if self.wandb_process:

/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py in join(self)
    530     def join(self) -> None:
--> 531         super().join()
    532 

/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface.py in join(self)
    702             return
--> 703         _ = self._communicate_shutdown()
    704 

/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py in _communicate_shutdown(self)
    427         record = self._make_record(request=request)
--> 428         _ = self._communicate(record)
    429 

/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py in _communicate(self, rec, timeout, local)
    293     ) -> Optional[pb.Result]:
--> 294         return self._communicate_async(rec, local=local).get(timeout=timeout)
    295 

/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_sock.py in _communicate_async(self, rec, local)
    59             raise Exception("The wandb backend process has shutdown")
---> 60         future = self._router.send_and_receive(rec, local=local)
    61         return future

/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/router.py in send_and_receive(self, rec, local)
    93 
---> 94         self._send_message(rec)
    95 

/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/router_sock.py in _send_message(self, record)
    35     def _send_message(self, record: "pb.Record") -> None:
---> 36         self._sock_client.send_record_communicate(record)

/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py in send_record_communicate(self, record)
    215         server_req.record_communicate.CopyFrom(record)
--> 216         self.send_server_request(server_req)
    217 

/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py in send_server_request(self, msg)
    154     def send_server_request(self, msg: Any) -> None:
--> 155         self._send_message(msg)
    156 

/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py in _send_message(self, msg)
    151         with self._lock:
--> 152             self._sendall_with_error_handle(header + data)
    153 

/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py in _sendall_with_error_handle(self, data)
    129             try:
--> 130                 sent = self._sock.send(data)
    131                 # sent equal to 0 indicates a closed socket

BrokenPipeError: [Errno 32] Broken pipe

The above exception was the direct cause of the following exception:

Error                                     Traceback (most recent call last)
<timed exec> in <module>

/usr/local/lib/python3.10/dist-packages/ludwig/api.py in train(self, dataset, training_set, validation_set, test_set, training_set_metadata, data_format, experiment_name, model_name, model_resume_path, skip_save_training_description, skip_save_training_statistics, skip_save_model, skip_save_progress, skip_save_log, skip_save_processed_input, output_directory, random_seed, **kwargs)
    590 
    591             for callback in self.callbacks:
--> 592                 callback.on_train_init(
    593                     base_config=self._user_config,
    594                     experiment_directory=output_directory,

/usr/local/lib/python3.10/dist-packages/ludwig/contribs/wandb.py in on_train_init(self, base_config, experiment_directory, experiment_name, model_name, output_directory, resume_directory)
    39     ):
    40         logger.info("wandb.on_train_init() called...")
---> 41         wandb.init(
    42             project=os.getenv("WANDB_PROJECT", experiment_name),
    43             name=model_name,

/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_init.py in init(job_type, dir, config, project, entity, reinit, tags, group, name, notes, magic, config_exclude_keys, config_include_keys, anonymous, mode, allow_val_change, resume, force, tensorboard, sync_tensorboard, monitor_gym, save_code, id, settings)
  1202                 wandb.termerror("Abnormal program exit")
  1203                 os._exit(1)
-> 1204             raise Error("An unexpected error occurred") from error_seen
  1205     return run

Error: An unexpected error occurred

The wandb package version is the currently latest, 0.16.0.

Is there some other detail I can provide? I can enable access to the run at wandb.ai.

karmi avatar Nov 16 '23 13:11 karmi

because of this error my progress of nearly 3days stopped in between. Now i have to start again. Is there any alternative for this or handler for this or should i just store the progress locally

JainitBITW avatar Nov 29 '23 13:11 JainitBITW

@Carlo-Argel
Hi, I have the same issue "BrokenPipeError: [Errno 32] Broken pipe"

Obrepal avatar Dec 03 '23 20:12 Obrepal

Similar error. Current SDK version is 0.16.1

2023-12-30 21:07:18,143 INFO MainThread:3025339 [wandb_init.py:init():614] starting backend 2023-12-30 21:07:18,143 INFO MainThread:3025339 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn 2023-12-30 21:07:18,153 INFO MainThread:3025339 [backend.py:ensure_launched():206] starting backend process... 2023-12-30 21:07:18,156 INFO MainThread:3025339 [backend.py:ensure_launched():211] started backend process with pid: 3027702 2023-12-30 21:07:18,157 INFO MainThread:3025339 [wandb_init.py:init():624] backend started and connected 2023-12-30 21:07:18,163 INFO MainThread:3025339 [wandb_init.py:init():716] updated telemetry 2023-12-30 21:07:18,165 INFO MainThread:3025339 [wandb_init.py:init():749] communicating run to backend with 90.0 second timeout 2023-12-30 21:07:23,329 ERROR MainThread:3025339 [wandb_init.py:init():1188] transport failed Traceback (most recent call last): File "/home/user/.local/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1170, in init run = wi.init() File "/home/user/.local/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 752, in init result = run_init_handle.wait( File "/home/user/.local/lib/python3.10/site-packages/wandb/sdk/lib/mailbox.py", line 281, in wait raise MailboxError("transport failed") wandb.sdk.lib.mailbox.MailboxError: transport failed

thohag avatar Dec 30 '23 21:12 thohag

I randomly get this error every now and then during training too, I assume it is related to networking issues. It would be great if any internal W&B issues wouldn't result in the run crashing.

{'loss': 55842.3375, 'learning_rate': 0.00019748020497041964, 'epoch': 1.32}
{'loss': 55757.2188, 'learning_rate': 0.0001974556426587668, 'epoch': 1.33}
  9%|██████████▋                                                                                                             | 2587/29100 [1:22:29<526:27:27, 71.48s/it]
Exception in thread NetStatThr:
Traceback (most recent call last):
  File "/home/user/mambaforge/envs/slt/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/home/user/mambaforge/envs/slt/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/mambaforge/envs/slt/lib/python3.11/site-packages/wandb/sdk/wandb_run.py", line 268, in check_network_status
    self._loop_check_status(
  File "/home/user/mambaforge/envs/slt/lib/python3.11/site-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status
    local_handle = request()
                   ^^^^^^^^^
  File "/home/user/mambaforge/envs/slt/lib/python3.11/site-packages/wandb/sdk/interface/interface.py", line 792, in deliver_network_status
    return self._deliver_network_status(status)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/mambaforge/envs/slt/lib/python3.11/site-packages/wandb/sdk/interface/interface_shared.py", line 500, in _deliver_network_status
    return self._deliver_record(record)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/mambaforge/envs/slt/lib/python3.11/site-packages/wandb/sdk/interface/interface_shared.py", line 449, in _deliver_record
    handle = mailbox._deliver_record(record, interface=self)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/mambaforge/envs/slt/lib/python3.11/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
    interface._publish(record)
  File "/home/user/mambaforge/envs/slt/lib/python3.11/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
    self._sock_client.send_record_publish(record)
  File "/home/user/mambaforge/envs/slt/lib/python3.11/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
    self.send_server_request(server_req)
  File "/home/user/mambaforge/envs/slt/lib/python3.11/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
    self._send_message(msg)
  File "/home/user/mambaforge/envs/slt/lib/python3.11/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
    self._sendall_with_error_handle(header + data)
  File "/home/user/mambaforge/envs/slt/lib/python3.11/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
    sent = self._sock.send(data)
           ^^^^^^^^^^^^^^^^^^^^^
BrokenPipeError: [Errno 32] Broken pipe
Killed```

nicholasdehnen avatar Jan 28 '24 20:01 nicholasdehnen

Getting the same issue exactly as directly above

gil2rok avatar Feb 27 '24 01:02 gil2rok

I'm also getting the same issue as above, however the run in wandb is finished (no errors) and all the data is in there too. I'm not sure how to interpret this error. It somehow also hangs the program indefinetely (although I'm not sure yet if this is a wandb issue or my queuing script).

I guess something important to note is that the queuing script I'm using makes a copy of my workspace in a temporary folder to be able to do multiprocessing. I'm not sure if this has any interaction with wandb, specially given that 95% of the runs are finished normally.

Edit: I've noticed that doing run.finish helps with that and so far had no more errors like that. The hanging also does not seem related to wandb. I'm not sure yet.

henrypickler avatar Feb 27 '24 12:02 henrypickler

similar issue at the end of the process but it does not affect other things

EhanW avatar Mar 07 '24 18:03 EhanW

Any updates @Carlo-Argel? This issue is killing joy of wandb, and it is just bizarre it takes so long to fix it.

Obrepal avatar Mar 10 '24 14:03 Obrepal

same issue here

bolak92 avatar Mar 19 '24 15:03 bolak92

I think one can see whether there is enough space for wandb logs and also check internet connection. In my case it was space issue.

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: Bola Khalil @.> Sent: Tuesday, March 19, 2024 9:03:31 PM To: wandb/wandb @.> Cc: Bafna Jainit Sushil @.>; Comment @.> Subject: Re: [wandb/wandb] [CLI]: BrokenPipeError: [Errno 32] Broken pipe (Issue #6449)

same issue here

— Reply to this email directly, view it on GitHubhttps://github.com/wandb/wandb/issues/6449#issuecomment-2007505419, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AWXKKIWUIOFRLKNXEIEZFULYZBLEXAVCNFSM6AAAAAA56GXAV2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBXGUYDKNBRHE. You are receiving this because you commented.Message ID: @.***>

JainitBITW avatar Mar 19 '24 15:03 JainitBITW

there is enough space. it only happens with sweeps.

bolak92 avatar Mar 21 '24 15:03 bolak92

same issue. i will no longer use wandb.

iFe1er avatar Apr 07 '24 06:04 iFe1er

there is enough space. it only happens with sweeps.

@bolak92 could you please provide a small reproduction so we could help fix it? Thanks and sorry that you are experiencing this issue.

kptkin avatar Apr 07 '24 18:04 kptkin

I think I might have found the reason. This error occurs when the script is modified while the program is running.

EhanW avatar Apr 08 '24 13:04 EhanW

I think I might have found the reason. This error occurs when the script is modified while the program is running.

oh interesting, yeah that could cause the system to be in a bad state. in any case if you want us to look into it further, providing a reproduction will be the best way to help here

kptkin avatar Apr 08 '24 18:04 kptkin

Hi guys,

I think I found a temporary solution.

For me it wasn't space issues.

but indeed, I believe it was the fact that the processes did not stop after the script has finished running. There seems to have been a way where wandb was automatically finishing those processes, but now that doesnt work properly.

What helped me was kiling the processes both on CPU obtained from top (straight forward killing with the PID)

but also the not so obvious GPU processes (something I only learned because of this issue :) )

  1. list the processes on gpu lsof /dev/nvidia*

  2. make sure that all the processes are yours and not some other users.

  3. if it s all yours and you dont need them (you want to kill them all) lsof /dev/nvidia* | awk '{print $2}' | xargs -I {} kill -9 {}

now reruning the script doesnt produce the error for me. I hope that helps.

bolak92 avatar Apr 08 '24 19:04 bolak92

Hi guys,

I think I found a temporary solution.

For me it wasn't space issues.

but indeed, I believe it was the fact that the processes did not stop after the script has finished running. There seems to have been a way where wandb was automatically finishing those processes, but now that doesnt work properly.

What helped me was kiling the processes both on CPU obtained from top (straight forward killing with the PID)

but also the not so obvious GPU processes (something I only learned because of this issue :) )

  1. list the processes on gpu lsof /dev/nvidia*
  2. make sure that all the processes are yours and not some other users.
  3. if it s all yours and you dont need them (you want to kill them all) lsof /dev/nvidia* | awk '{print $2}' | xargs -I {} kill -9 {}

now reruning the script doesnt produce the error for me. I hope that helps.

Thanks for sharing your experience, it should be the case that the service suppose to finish all the active run when the main script completes (we use an atexit hook to trigger it), if it is not the case, it is a bug. Ideally to always make sure your run was marked as completed adding run.finish in the end of the runs usage should make sure that your run was completed.

do you think you could provide a reproduction of your script? i'm interested to learn why the nvidia processes still running and how we can better handle these cases

kptkin avatar Apr 08 '24 21:04 kptkin

Also running into this error mid-training -- any ideas on how to solve it?

wandb: Find logs at: ./wandb/run-20240414_095858-2t5b2bol/logs
Exception in thread NetStatThr:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Exception in thread ChkStopThr:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 268, in check_network_status
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._loop_check_status(
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status
    self._target(*self._args, **self._kwargs)
    local_handle = request()
    local_handle = request()
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 286, in check_stop_status
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface.py", line 848, in deliver_network_status
    self._loop_check_status(
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status
    local_handle = request()
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface.py", line 840, in deliver_stop_status
    return self._deliver_network_status(status)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py", line 510, in _deliver_network_status
    return self._deliver_stop_status(status)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py", line 494, in _deliver_stop_status
    return self._deliver_record(record)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py", line 459, in _deliver_record
    return self._deliver_record(record)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py", line 459, in _deliver_record
    handle = mailbox._deliver_record(record, interface=self)
    handle = mailbox._deliver_record(record, interface=self)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
    interface._publish(record)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
    interface._publish(record)
    self._sock_client.send_record_publish(record)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
    self._sock_client.send_record_publish(record)
    self.send_server_request(server_req)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
    self.send_server_request(server_req)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
    self._send_message(msg)
    self._send_message(msg)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
    self._sendall_with_error_handle(header + data)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
    sent = self._sock.send(data)
    self._sendall_with_error_handle(header + data)
BrokenPipeError: [Errno 32] Broken pipe
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
    sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe

pminervini avatar Apr 15 '24 06:04 pminervini

@pminervini could you make sure that you don't have WANDB_SERVICE env variable set and kill all potential wandb-service process: ps -ef | grep wandb-service to get a list of these processes. Once you clean your env, everything should work.

kptkin avatar Apr 17 '24 05:04 kptkin

@pminervini could you make sure that you don't have WANDB_SERVICE env variable set and kill all potential wandb-service process: ps -ef | grep wandb-service to get a list of these processes. Once you clean your env, everything should work.

@kptkin how does just adding wandb.finish() at the end of the script look to you? That would take care of dangling processes, no?

pminervini avatar Apr 17 '24 06:04 pminervini

I found that when I launch a Weights & Biases (wandb) service with simulated data alone, there are no issues with the service communication. However, when I simultaneously load a model on the GPU, the wandb service immediately stops (with the same error as mentioned above). If I restart the wandb service at this point, I notice that it will automatically stop after a fixed period (about 1 minute). Could this be related to the load balancer?

Training /chenhui/zhangwuhan/stage2/trained_model/qwen1.5_7b_5_5e-5_2_1k_plugin 0
  0%|                                                                                                 | 0/14272 [00:06<?, ?it/s, train_loss=4.85]2024-04-27 17:18:41,774 - DEBUG - Successfully logged to WandB
  0%|                                                                                       | 1/14272 [00:11<25:10:01,  6.35s/it, train_loss=2.9]2024-04-27 17:18:47,054 - DEBUG - Successfully logged to WandB
  0%|                                                                                      | 2/14272 [00:16<22:40:19,  5.72s/it, train_loss=2.25]2024-04-27 17:18:52,202 - DEBUG - Successfully logged to WandB
  0%|                                                                                      | 3/14272 [00:22<21:38:07,  5.46s/it, train_loss=2.09]2024-04-27 17:18:57,456 - DEBUG - Successfully logged to WandB
  0%|                                                                                      | 4/14272 [00:27<21:18:49,  5.38s/it, train_loss=2.02]2024-04-27 17:19:02,601 - DEBUG - Successfully logged to WandB
  0%|                                                                                      | 5/14272 [00:32<20:58:47,  5.29s/it, train_loss=1.88]2024-04-27 17:19:07,853 - DEBUG - Successfully logged to WandB
  0%|                                                                                      | 6/14272 [00:38<25:34:01,  6.45s/it, train_loss=1.87]
Traceback (most recent call last):
  File "/chenhui/zhangwuhan/stage2/FastChat/fastchat/train/qwen1.5_7b_5_5e-5_2_1k_plugin.py", line 290, in <module>
    main()
  File "/chenhui/zhangwuhan/stage2/FastChat/fastchat/train/qwen1.5_7b_5_5e-5_2_1k_plugin.py", line 283, in main
    accelerator.log({"train_loss":  loss.item()}, step=batch_idx)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 602, in _inner
    return PartialState().on_main_process(function)(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2267, in log
    tracker.log(values, step=step, **log_kwargs.get(tracker.name, {}))
  File "/usr/local/lib/python3.10/dist-packages/accelerate/tracking.py", line 86, in execute_on_main_process
    return function(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/tracking.py", line 333, in log
    self.run.log(values, step=step, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 420, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 371, in wrapper_fn
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 361, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1838, in log
    self._log(data=data, step=step, commit=commit)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1602, in _log
    self._partial_history_callback(data, step, commit)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1474, in _partial_history_callback
    self._backend.interface.publish_partial_history(
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface.py", line 602, in publish_partial_history
    self._publish_partial_history(partial_history)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py", line 89, in _publish_partial_history
    self._publish(rec)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
    self._sock_client.send_record_publish(record)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
    self.send_server_request(server_req)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
    self._send_message(msg)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
    self._sendall_with_error_handle(header + data)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
    sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
wandb: While tearing down the service manager. The following error has occurred: [Errno 32] Broken pipe
2024-04-27 17:19:14,144 - DEBUG - Starting new HTTPS connection (1): o151352.ingest.sentry.io:443
2024-04-27 17:19:16,144 - DEBUG - Attempting to acquire lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-04-27 17:19:16,144 - DEBUG - Lock 140175619989824 acquired on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-04-27 17:19:16,145 - DEBUG - Attempting to release lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-04-27 17:19:16,145 - DEBUG - Lock 140175619989824 released on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-04-27 17:19:16,145 - DEBUG - Attempting to acquire lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
2024-04-27 17:19:16,145 - DEBUG - Lock 140175619989824 acquired on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
2024-04-27 17:19:16,146 - DEBUG - Attempting to release lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
2024-04-27 17:19:16,146 - DEBUG - Lock 140175619989824 released on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock

I hope the official can resolve this issue as soon as possible.

endNone avatar Apr 27 '24 17:04 endNone

@pminervini could you make sure that you don't have WANDB_SERVICE env variable set and kill all potential wandb-service process: ps -ef | grep wandb-service to get a list of these processes. Once you clean your env, everything should work.

@kptkin how does just adding wandb.finish() at the end of the script look to you? That would take care of dangling processes, no?

wandb.finish() will close the last active run, and hopefully close the service if it is the only run in the service, but if the sevice is left behind, the current process is not aware of it. We have a fix that should handle this, but it is not merged yet, hopefully in one of the upcoming releases.

kptkin avatar Apr 29 '24 16:04 kptkin

I found that when I launch a Weights & Biases (wandb) service with simulated data alone, there are no issues with the service communication. However, when I simultaneously load a model on the GPU, the wandb service immediately stops (with the same error as mentioned above). If I restart the wandb service at this point, I notice that it will automatically stop after a fixed period (about 1 minute). Could this be related to the load balancer?

Training /chenhui/zhangwuhan/stage2/trained_model/qwen1.5_7b_5_5e-5_2_1k_plugin 0
  0%|                                                                                                 | 0/14272 [00:06<?, ?it/s, train_loss=4.85]2024-04-27 17:18:41,774 - DEBUG - Successfully logged to WandB
  0%|                                                                                       | 1/14272 [00:11<25:10:01,  6.35s/it, train_loss=2.9]2024-04-27 17:18:47,054 - DEBUG - Successfully logged to WandB
  0%|                                                                                      | 2/14272 [00:16<22:40:19,  5.72s/it, train_loss=2.25]2024-04-27 17:18:52,202 - DEBUG - Successfully logged to WandB
  0%|                                                                                      | 3/14272 [00:22<21:38:07,  5.46s/it, train_loss=2.09]2024-04-27 17:18:57,456 - DEBUG - Successfully logged to WandB
  0%|                                                                                      | 4/14272 [00:27<21:18:49,  5.38s/it, train_loss=2.02]2024-04-27 17:19:02,601 - DEBUG - Successfully logged to WandB
  0%|                                                                                      | 5/14272 [00:32<20:58:47,  5.29s/it, train_loss=1.88]2024-04-27 17:19:07,853 - DEBUG - Successfully logged to WandB
  0%|                                                                                      | 6/14272 [00:38<25:34:01,  6.45s/it, train_loss=1.87]
Traceback (most recent call last):
  File "/chenhui/zhangwuhan/stage2/FastChat/fastchat/train/qwen1.5_7b_5_5e-5_2_1k_plugin.py", line 290, in <module>
    main()
  File "/chenhui/zhangwuhan/stage2/FastChat/fastchat/train/qwen1.5_7b_5_5e-5_2_1k_plugin.py", line 283, in main
    accelerator.log({"train_loss":  loss.item()}, step=batch_idx)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 602, in _inner
    return PartialState().on_main_process(function)(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2267, in log
    tracker.log(values, step=step, **log_kwargs.get(tracker.name, {}))
  File "/usr/local/lib/python3.10/dist-packages/accelerate/tracking.py", line 86, in execute_on_main_process
    return function(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/tracking.py", line 333, in log
    self.run.log(values, step=step, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 420, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 371, in wrapper_fn
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 361, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1838, in log
    self._log(data=data, step=step, commit=commit)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1602, in _log
    self._partial_history_callback(data, step, commit)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/wandb_run.py", line 1474, in _partial_history_callback
    self._backend.interface.publish_partial_history(
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface.py", line 602, in publish_partial_history
    self._publish_partial_history(partial_history)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_shared.py", line 89, in _publish_partial_history
    self._publish(rec)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
    self._sock_client.send_record_publish(record)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
    self.send_server_request(server_req)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
    self._send_message(msg)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
    self._sendall_with_error_handle(header + data)
  File "/usr/local/lib/python3.10/dist-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
    sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
wandb: While tearing down the service manager. The following error has occurred: [Errno 32] Broken pipe
2024-04-27 17:19:14,144 - DEBUG - Starting new HTTPS connection (1): o151352.ingest.sentry.io:443
2024-04-27 17:19:16,144 - DEBUG - Attempting to acquire lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-04-27 17:19:16,144 - DEBUG - Lock 140175619989824 acquired on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-04-27 17:19:16,145 - DEBUG - Attempting to release lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-04-27 17:19:16,145 - DEBUG - Lock 140175619989824 released on /root/.triton/autotune/Fp16Matmul_2d_kernel.pickle.lock
2024-04-27 17:19:16,145 - DEBUG - Attempting to acquire lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
2024-04-27 17:19:16,145 - DEBUG - Lock 140175619989824 acquired on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
2024-04-27 17:19:16,146 - DEBUG - Attempting to release lock 140175619989824 on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock
2024-04-27 17:19:16,146 - DEBUG - Lock 140175619989824 released on /root/.triton/autotune/Fp16Matmul_4d_kernel.pickle.lock

I hope the official can resolve this issue as soon as possible.

@endNone from your description it seems like some bad interaction of your integration and wandb, if you could provide a reproduction script, it will be helpful for us to actually debug it and find a fix for it. Thanks for all the above information. Broken pipe just means that service is not reachable, if you provide a repro we could figure out what went wrong.

kptkin avatar Apr 29 '24 16:04 kptkin