gRPC borken pipe when using multiple loggers on PL with remote AIM server
🐛 Bug
My trainer
# log locally:
tb_logger = pl.loggers.TensorBoardLogger(
save_dir=str(EXP_ROOT),
name='my_model',
default_hp_metric=False,
)
# log remotely on aim server
aim_logger = AimLogger(
repo='aim://127.0.0.1:53800',
experiment='my_model',
train_metric_prefix='train_',
test_metric_prefix='test_',
val_metric_prefix='val_',
)
trainer = pl.Trainer(
logger=[tb_logger, aim_logger],
)
My training and validation step:
def training_step(self, batch, batch_idx):
t0, t1, mask_gt = batch
mask_pred = self.forward(t0, t1)
loss = self.criterion(mask_pred, mask_gt.long())
metrics = {
'train_loss': loss,
}
self.log_dict(metrics, logger=False, on_step=False, on_epoch=True, prog_bar=True, rank_zero_only=True)
self.logger[1].log_metrics({'train_loss1': loss, 'train_loss1': loss})
return loss
def validation_step(self, batch, batch_idx):
t0, t1, mask_gt = batch
mask_pred = self.forward(t0, t1)
loss = self.criterion(mask_pred, mask_gt.long())
metrics = {
'val_loss': loss,
}
self.log_dict(metrics, logger=False, on_step=False, on_epoch=True, prog_bar=True, rank_zero_only=True)
self.logger[1].log_metrics({'train_loss1': loss, 'train_loss1': loss})
return loss
crashes:
0207 14:42:36.177987824 17936 chttp2_transport.cc:1111] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings" | 4/6 [00:01<00:00, 4.07it/s]
E0207 14:42:36.178162323 17936 client_channel.cc:647] chand=0x8fc9e40: Illegal keepalive throttling value 9223372036854775807
Traceback (most recent call last):
File "/.../lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/.../lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/.../lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
self._dispatch()
File "/.../lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
self.training_type_plugin.start_training(self)
File "/.../lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "/.../lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
return self._run_train()
File "/.../lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
self.fit_loop.run()
File "/.../lib/python3.6/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/.../lib/python3.6/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
self.epoch_loop.run(data_fetcher)
File "/.../python3.6/site-packages/pytorch_lightning/loops/base.py", line 146, in run
self.on_advance_end()
File "/.../python3.6/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 242, in on_advance_end
self._run_validation()
File "/.../lib/python3.6/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 337, in _run_validation
self.val_loop.run()
File "/.../lib/python3.6/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/.../lib/python3.6/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
dl_outputs = self.epoch_loop.run(dataloader, dataloader_idx, dl_max_batches, self.num_dataloaders)
File "/.../lib/python3.6/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/.../lib/python3.6/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 122, in advance
output = self._evaluation_step(batch, batch_idx, dataloader_idx)
File "/.../lib/python3.6/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 217, in _evaluation_step
output = self.trainer.accelerator.validation_step(step_kwargs)
File "/.../lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 236, in validation_step
return self.training_type_plugin.validation_step(*step_kwargs.values())
File "/.../lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 444, in validation_step
return self.model(*args, **kwargs)
File "/.../lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/.../lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/.../lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/..../lib/python3.6/site-packages/pytorch_lightning/overrides/base.py", line 92, in forward
output = self.module.validation_step(*inputs, **kwargs)
File "/.../architectures/tanet_arch/tanet.py", line 135, in validation_step
self.logger[1].log_metrics(metrics, self.current_epoch)
File "/.../lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py", line 50, in wrapped_fn
return fn(*args, **kwargs)
File "/.../lib/python3.6/site-packages/aim/sdk/adapters/pytorch_lightning.py", line 99, in log_metrics
self.experiment.track(v, name=name, context=context)
File "/.../lib/python3.6/site-packages/aim/sdk/run.py", line 412, in track
self._track_impl(value, track_time, name, step, epoch, context=context)
File "/.../lib/python3.6/site-packages/aim/sdk/run.py", line 445, in _track_impl
self.meta_run_tree['traces', ctx.idx, name, 'last'] = val
File "/.../lib/python3.6/site-packages/aim/storage/treeviewproxy.py", line 187, in __setitem__
self.tree[self.absolute_path(path)] = value
File "/.../lib/python3.6/site-packages/aim/storage/treeviewproxy.py", line 83, in __setitem__
self._rpc_client.run_instruction(self._handler, '__setitem__', (path, value))
File "/.../lib/python3.6/site-packages/aim/ext/transport/client.py", line 66, in run_instruction
status_msg = next(resp)
File "/.../lib/python3.6/site-packages/grpc/_channel.py", line 426, in __next__
return self._next()
File "/.../lib/python3.6/site-packages/grpc/_channel.py", line 826, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Broken pipe"
debug_error_string = "{"created":"@1644244955.352788922","description":"Error received from peer ipv4:127.0.0.1:53800","file":"src/core/lib/surface/call.cc","file_line":1063,"grpc_message":"Broken pipe","grpc_status":14}"
>
During handling of the above exception, another exception occurred:
...
Environment
- Aim Version: 3.5.1
- Python version: 3.6.9
- pip version: 21.3.0
- Linux 18.04
Additional context
Hi @mikel-brostrom. Looks like this is a blocker, I'm going to give it a high priority. Could you confirm, please?
Yes, it is a blocker. Could you reproduce the error? Let me know if you need more information
Yes, it is a blocker. Could you reproduce the error? Let me know if you need more information
Unfortunately I couldn't reproduce it. Does this happen only when using multiple loggers or it happens even when using Aim as the only logger?
I just tried it out, it does happen with Aim as the only logger as well when logging remotely.
I tried this example: https://github.com/aimhubio/aim/blob/main/examples/pytorch_lightning_track.py
Adding the repo argument in AimLogger as repo='aim://xx.xx.xx.xx:53800' where xx.xx.xx.xx is the machine where aim server is running
Strange. I tried the same setup with the same example, works for me fine. Can I ask if you do have any other clients logging at the same time or is it just one client?
Hi @mikel-brostrom. I still cannot reproduce this issue. Would you mind to join aim community slack so we can debug this more interactively?
Hi,
we encountered the same problem during the integration of AimLogger into our Pytorch Lightning based codebase. We could "fix" the problem by setting torch.multiprocessing.set_start_method('spawn', force=True) before creating the DataLoader instances to prevent PL from using fork when multiple workers are used. So this is probably related to forking in Python.
Hi @dallmann-uniwue! Thanks for your input; this might help us to finally find the root cause of the issue. Would you mind to share some minimal example where the above mentioned setting fixes the problem? That would be a huge help 🙏 Also we would like to reproduce with the exact same environment as you have, so having package versions will be helpful as well.
Hi @alberttorosyan, I'm not sure when I will find the time to design a minimal working example for this issue, but I can supply some specifics about the environment right away:
OS: Fedora 35 Python: 3.9.9 (managed by PyEnv)
torch==1.9.1
pytorch-lightning==1.4.7
aim==3.6.3
I hope this already helps. If you need any more specifics, please let me know.
Hi @dallmann-uniwue @mikel-brostrom
We would like to go forward with this yet we still cannot reproduce nor understand wheres the issue. So for that I would ask you to set gRPC environment variable to log every step it goes trough. We then might be able to track down the issue on our end.
export GRPC_TRACE=all
Will wait for your logs, Thanks!
Hi, I will hopefully find some time tonight to give it a try!
Hi, I ran my code without the above fix with the mentioned environment variable set. Unfortunately, it does not output any more infos than before (I've pruned my own logs from the output): grpc-output.txt
I noticed that I was not able to consistently reproduce the issue. Today, it failed one out of three times with the above message. I also tried to set export GRPC_VERBOSITY=INFO which resulted in large amounts of logging data.
grpc-output-debug.zip
I could'nt make sense of the output, but I hope this helps.
Hi, did you have time to take a look at the logs and could you identify the issue?
@dallmann-uniwue sorry for late response - we have been checking lots of stuff to reproduce this What we have tried so far:
- Simulate package loss / high ping on remote server
- Experiment with
keepalivegRPC properties - Run processes in
fork<- we believe this can be an issue really but still couldnt reproduce it yet
We came across this gRPC guide - https://github.com/grpc/grpc/blob/master/doc/fork_support.md Apparently users can face issue when using gRPC resources in forked process, it's worth the try to set the env variables and see wether or not the issue has gone. Would appreciate your feedback on this.
export GRPC_ENABLE_FORK_SUPPORT=true
export GRPC_POLL_STRATEGY=poll
Beside this we would appreciate any sample testcase that is failing for you for our further investigation. Thanks!
@devfox-se thank you for still investigating this issue!
I've tried the environment variables you suggested with fork enabled and can report that it fixes this issue in my case.
However, there seems to be a drawback. The training gets stuck when spinning up the DataLoader workers if the parallelization is set to num_workers > 6. Below that I did not encounter any issues. I tried num_workers = 0, 1, 2, 4, 6, 8.
i'm having the same issue. I tried the suggested env vars and got the following errors:
[rank: 0] Received SIGTERM: 15
pid 714713 killing 714318 with 15
pid 714713 killing 714319 with 15
pid 714713 killing 714320 with 15
I am also facing the same issue as @lminer when using the suggested environment variables. I can also confirm that this issue happens only when we set num_workers > 0 in the dataloader.
@mikel-brostrom, With the Aim 3.19 we've completely changed the tracking server implementation. It it not using gRPC anymore. Could you please try it out and see if the new tracking server works fine in this case?