issue-tracking icon indicating copy to clipboard operation
issue-tracking copied to clipboard

Error running on ddp (can't pickle local object 'SummaryTopic) with comet logger (pytorch lightning)

Open dvirginz opened this issue 5 years ago • 7 comments

I have the following problem running on ddp mode with cometlogger. When I detach the logger from the trainer (i.e deletinglogger=comet_logger) the code runs.

Exception has occurred: AttributeError
Can't pickle local object 'SummaryTopic.__init__.<locals>.default'
  File "/path/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "/path/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/path/multiprocessing/popen_fork.py", line 20, in __init__
    self._launch(process_obj)
  File "/path/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/path/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/path/multiprocessing/process.py", line 112, in start
    self._popen = self._Popen(self)
  File "/path/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn
    process.start()
  File "/path/site-packages/pytorch_lightning/trainer/trainer.py", line 751, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
  File "/repo_path/train.py", line 158, in main_train
    trainer.fit(model)
  File "/repo_path/train.py", line 72, in main
    main_train(model_class_pointer, hyperparams, logger)
  File "/repo_path/train.py", line 167, in <module>
    main()
  File "/path/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/path/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/path/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/path/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/path/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)

dvirginz avatar May 03 '20 21:05 dvirginz

@dvirginz This is likely just the first of a few errors that you would get if you pickle an Experiment object. Now the question is: why is ddp attempting to pickle the Experiment when the logger is enabled?

We're looking into this.

dsblank avatar May 04 '20 19:05 dsblank

The way pl is working on multi node env is to pickle and send the whole model. I assume that when you guys are running on ddp envs the experiment object stayes on the main process? If so, how can you log per process events? (for example specific samples that happened to end up at gpu x that we would like to log?)

Anyhow, thanks for looking into that 🙂

dvirginz avatar May 04 '20 19:05 dvirginz

@dvirginz Experiments don't all have to be on the main process. For example, we have our own runner which can be used with or without our Optimizer, and you can coordinate which GPU a process runs on. For more details on that see: https://www.comet.ml/docs/command-line/#comet-optimize

In general, the Experiment object wasn't designed to be pickled as it has live connections to the server. You can work around that though. For example, there is the ExistingExperiment, but also you can just delay creating the Experiment until you are in the process or thread.

Let me know if you would like more information on any of the above.

dsblank avatar May 07 '20 16:05 dsblank

So if I understand correctly, you suggest not to use comet-pytorch_lightning logging infrastructure, but for example, pass the ExistingExperiment id as a hyperparameter, create the logging object inside the thread, and take care of the logging myself(instead of the automatic logging using CometLogger object)?

I'll try that and update.

dvirginz avatar May 08 '20 04:05 dvirginz

@dvirginz Could you post your Trainer parameters?

dsblank avatar Jun 09 '20 17:06 dsblank

Hi @dsblank , we talked about it on slack too. Even the simplest configuration it happens, as each distributed process creates a new experiment.

trainer = pl.Trainer(
        logger= CometLogger(
        api_key="ID",
        workspace="User",
        project_name="proj",
        experiment_key=experiment_id,
    ),
        auto_select_gpus=True,
        gpus=3,
        distributed_backend="ddp",
   )

(I'm still on 3.1.8)

dvirginz avatar Jun 10 '20 04:06 dvirginz

@dvirginz Update: we are preparing some documentation to help wrestle with this issue. Here is one example with some hints that may help: https://github.com/comet-ml/comet-examples/tree/master/pytorch#using-cometml-with-pytorch-parallel-data-training

dsblank avatar Aug 28 '20 15:08 dsblank

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Nov 09 '23 21:11 github-actions[bot]

This issue was closed because it has been stalled for 5 days with no activity.

github-actions[bot] avatar Nov 14 '23 21:11 github-actions[bot]