aim icon indicating copy to clipboard operation
aim copied to clipboard

gRPC borken pipe when using multiple loggers on PL with remote AIM server

Open mikel-brostrom opened this issue 4 years ago • 19 comments

🐛 Bug

My trainer


# log locally: 
tb_logger = pl.loggers.TensorBoardLogger(
    save_dir=str(EXP_ROOT),
    name='my_model',
    default_hp_metric=False,
)

# log remotely on aim server
aim_logger = AimLogger(
    repo='aim://127.0.0.1:53800',
    experiment='my_model',
    train_metric_prefix='train_',
    test_metric_prefix='test_',
    val_metric_prefix='val_',
)

trainer = pl.Trainer(
        logger=[tb_logger, aim_logger],
)

My training and validation step:

def training_step(self, batch, batch_idx):

        t0, t1, mask_gt = batch
        mask_pred = self.forward(t0, t1)

        loss = self.criterion(mask_pred, mask_gt.long())
        metrics = {
            'train_loss': loss,
        }

        self.log_dict(metrics, logger=False, on_step=False, on_epoch=True, prog_bar=True, rank_zero_only=True)
        self.logger[1].log_metrics({'train_loss1': loss, 'train_loss1': loss})

        return loss


def validation_step(self, batch, batch_idx):

        t0, t1, mask_gt = batch
        mask_pred = self.forward(t0, t1)

        loss = self.criterion(mask_pred, mask_gt.long())
        metrics = {
            'val_loss': loss,
        }

        self.log_dict(metrics, logger=False, on_step=False, on_epoch=True, prog_bar=True, rank_zero_only=True)
        self.logger[1].log_metrics({'train_loss1': loss, 'train_loss1': loss})

        return loss

crashes:

0207 14:42:36.177987824   17936 chttp2_transport.cc:1111]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"                                        | 4/6 [00:01<00:00,  4.07it/s]
E0207 14:42:36.178162323   17936 client_channel.cc:647]      chand=0x8fc9e40: Illegal keepalive throttling value 9223372036854775807
Traceback (most recent call last):
  File "/.../lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 685, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/.../lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 777, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/.../lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1199, in _run
    self._dispatch()
  File "/.../lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1279, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/.../lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/.../lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
    return self._run_train()
  File "/.../lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 1319, in _run_train
    self.fit_loop.run()
  File "/.../lib/python3.6/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/.../lib/python3.6/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
    self.epoch_loop.run(data_fetcher)
  File "/.../python3.6/site-packages/pytorch_lightning/loops/base.py", line 146, in run
    self.on_advance_end()
  File "/.../python3.6/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 242, in on_advance_end
    self._run_validation()
  File "/.../lib/python3.6/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 337, in _run_validation
    self.val_loop.run()
  File "/.../lib/python3.6/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/.../lib/python3.6/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 110, in advance
    dl_outputs = self.epoch_loop.run(dataloader, dataloader_idx, dl_max_batches, self.num_dataloaders)
  File "/.../lib/python3.6/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/.../lib/python3.6/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 122, in advance
    output = self._evaluation_step(batch, batch_idx, dataloader_idx)
  File "/.../lib/python3.6/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 217, in _evaluation_step
    output = self.trainer.accelerator.validation_step(step_kwargs)
  File "/.../lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 236, in validation_step
    return self.training_type_plugin.validation_step(*step_kwargs.values())
  File "/.../lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 444, in validation_step
    return self.model(*args, **kwargs)
  File "/.../lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/.../lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/.../lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/..../lib/python3.6/site-packages/pytorch_lightning/overrides/base.py", line 92, in forward
    output = self.module.validation_step(*inputs, **kwargs)
  File "/.../architectures/tanet_arch/tanet.py", line 135, in validation_step
    self.logger[1].log_metrics(metrics, self.current_epoch)
  File "/.../lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py", line 50, in wrapped_fn
    return fn(*args, **kwargs)
  File "/.../lib/python3.6/site-packages/aim/sdk/adapters/pytorch_lightning.py", line 99, in log_metrics
    self.experiment.track(v, name=name, context=context)
  File "/.../lib/python3.6/site-packages/aim/sdk/run.py", line 412, in track
    self._track_impl(value, track_time, name, step, epoch, context=context)
  File "/.../lib/python3.6/site-packages/aim/sdk/run.py", line 445, in _track_impl
    self.meta_run_tree['traces', ctx.idx, name, 'last'] = val
  File "/.../lib/python3.6/site-packages/aim/storage/treeviewproxy.py", line 187, in __setitem__
    self.tree[self.absolute_path(path)] = value
  File "/.../lib/python3.6/site-packages/aim/storage/treeviewproxy.py", line 83, in __setitem__
    self._rpc_client.run_instruction(self._handler, '__setitem__', (path, value))
  File "/.../lib/python3.6/site-packages/aim/ext/transport/client.py", line 66, in run_instruction
    status_msg = next(resp)
  File "/.../lib/python3.6/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/.../lib/python3.6/site-packages/grpc/_channel.py", line 826, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNAVAILABLE
        details = "Broken pipe"
        debug_error_string = "{"created":"@1644244955.352788922","description":"Error received from peer ipv4:127.0.0.1:53800","file":"src/core/lib/surface/call.cc","file_line":1063,"grpc_message":"Broken pipe","grpc_status":14}"
>

During handling of the above exception, another exception occurred:
...

Environment

  • Aim Version: 3.5.1
  • Python version: 3.6.9
  • pip version: 21.3.0
  • Linux 18.04

Additional context

mikel-brostrom avatar Feb 07 '22 14:02 mikel-brostrom

Hi @mikel-brostrom. Looks like this is a blocker, I'm going to give it a high priority. Could you confirm, please?

mihran113 avatar Feb 07 '22 15:02 mihran113

Yes, it is a blocker. Could you reproduce the error? Let me know if you need more information

mikel-brostrom avatar Feb 07 '22 15:02 mikel-brostrom

Yes, it is a blocker. Could you reproduce the error? Let me know if you need more information

Unfortunately I couldn't reproduce it. Does this happen only when using multiple loggers or it happens even when using Aim as the only logger?

mihran113 avatar Feb 07 '22 16:02 mihran113

I just tried it out, it does happen with Aim as the only logger as well when logging remotely.

mikel-brostrom avatar Feb 08 '22 13:02 mikel-brostrom

I tried this example: https://github.com/aimhubio/aim/blob/main/examples/pytorch_lightning_track.py Adding the repo argument in AimLogger as repo='aim://xx.xx.xx.xx:53800' where xx.xx.xx.xx is the machine where aim server is running

mikel-brostrom avatar Feb 08 '22 13:02 mikel-brostrom

Strange. I tried the same setup with the same example, works for me fine. Can I ask if you do have any other clients logging at the same time or is it just one client?

mihran113 avatar Feb 08 '22 14:02 mihran113

Hi @mikel-brostrom. I still cannot reproduce this issue. Would you mind to join aim community slack so we can debug this more interactively?

mihran113 avatar Feb 15 '22 11:02 mihran113

Hi, we encountered the same problem during the integration of AimLogger into our Pytorch Lightning based codebase. We could "fix" the problem by setting torch.multiprocessing.set_start_method('spawn', force=True) before creating the DataLoader instances to prevent PL from using fork when multiple workers are used. So this is probably related to forking in Python.

dallmann-uniwue avatar Mar 08 '22 16:03 dallmann-uniwue

Hi @dallmann-uniwue! Thanks for your input; this might help us to finally find the root cause of the issue. Would you mind to share some minimal example where the above mentioned setting fixes the problem? That would be a huge help 🙏 Also we would like to reproduce with the exact same environment as you have, so having package versions will be helpful as well.

alberttorosyan avatar Mar 08 '22 18:03 alberttorosyan

Hi @alberttorosyan, I'm not sure when I will find the time to design a minimal working example for this issue, but I can supply some specifics about the environment right away:

OS: Fedora 35 Python: 3.9.9 (managed by PyEnv)

torch==1.9.1
pytorch-lightning==1.4.7
aim==3.6.3

I hope this already helps. If you need any more specifics, please let me know.

dallmann-uniwue avatar Mar 08 '22 20:03 dallmann-uniwue

Hi @dallmann-uniwue @mikel-brostrom

We would like to go forward with this yet we still cannot reproduce nor understand wheres the issue. So for that I would ask you to set gRPC environment variable to log every step it goes trough. We then might be able to track down the issue on our end.

export GRPC_TRACE=all

Will wait for your logs, Thanks!

devfox-se avatar Mar 22 '22 11:03 devfox-se

Hi, I will hopefully find some time tonight to give it a try!

dallmann-uniwue avatar Mar 23 '22 10:03 dallmann-uniwue

Hi, I ran my code without the above fix with the mentioned environment variable set. Unfortunately, it does not output any more infos than before (I've pruned my own logs from the output): grpc-output.txt

I noticed that I was not able to consistently reproduce the issue. Today, it failed one out of three times with the above message. I also tried to set export GRPC_VERBOSITY=INFO which resulted in large amounts of logging data. grpc-output-debug.zip

I could'nt make sense of the output, but I hope this helps.

dallmann-uniwue avatar Mar 23 '22 20:03 dallmann-uniwue

Hi, did you have time to take a look at the logs and could you identify the issue?

dallmann-uniwue avatar Mar 31 '22 12:03 dallmann-uniwue

@dallmann-uniwue sorry for late response - we have been checking lots of stuff to reproduce this What we have tried so far:

  • Simulate package loss / high ping on remote server
  • Experiment with keepalive gRPC properties
  • Run processes in fork <- we believe this can be an issue really but still couldnt reproduce it yet

We came across this gRPC guide - https://github.com/grpc/grpc/blob/master/doc/fork_support.md Apparently users can face issue when using gRPC resources in forked process, it's worth the try to set the env variables and see wether or not the issue has gone. Would appreciate your feedback on this.

export GRPC_ENABLE_FORK_SUPPORT=true
export GRPC_POLL_STRATEGY=poll

Beside this we would appreciate any sample testcase that is failing for you for our further investigation. Thanks!

devfox-se avatar Apr 14 '22 10:04 devfox-se

@devfox-se thank you for still investigating this issue! I've tried the environment variables you suggested with fork enabled and can report that it fixes this issue in my case.

However, there seems to be a drawback. The training gets stuck when spinning up the DataLoader workers if the parallelization is set to num_workers > 6. Below that I did not encounter any issues. I tried num_workers = 0, 1, 2, 4, 6, 8.

dallmann-uniwue avatar Apr 19 '22 08:04 dallmann-uniwue

i'm having the same issue. I tried the suggested env vars and got the following errors:

[rank: 0] Received SIGTERM: 15
pid 714713 killing 714318 with 15
pid 714713 killing 714319 with 15
pid 714713 killing 714320 with 15

lminer avatar Apr 02 '23 21:04 lminer

I am also facing the same issue as @lminer when using the suggested environment variables. I can also confirm that this issue happens only when we set num_workers > 0 in the dataloader.

vikigenius avatar Jun 22 '23 03:06 vikigenius

@mikel-brostrom, With the Aim 3.19 we've completely changed the tracking server implementation. It it not using gRPC anymore. Could you please try it out and see if the new tracking server works fine in this case?

alberttorosyan avatar Apr 22 '24 08:04 alberttorosyan