issue-tracking
issue-tracking copied to clipboard
Error running on ddp (can't pickle local object 'SummaryTopic) with comet logger (pytorch lightning)
I have the following problem running on ddp mode with cometlogger.
When I detach the logger from the trainer (i.e deletinglogger=comet_logger) the code runs.
Exception has occurred: AttributeError
Can't pickle local object 'SummaryTopic.__init__.<locals>.default'
File "/path/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "/path/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/path/multiprocessing/popen_fork.py", line 20, in __init__
self._launch(process_obj)
File "/path/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/path/multiprocessing/context.py", line 284, in _Popen
return Popen(process_obj)
File "/path/multiprocessing/process.py", line 112, in start
self._popen = self._Popen(self)
File "/path/site-packages/torch/multiprocessing/spawn.py", line 162, in spawn
process.start()
File "/path/site-packages/pytorch_lightning/trainer/trainer.py", line 751, in fit
mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))
File "/repo_path/train.py", line 158, in main_train
trainer.fit(model)
File "/repo_path/train.py", line 72, in main
main_train(model_class_pointer, hyperparams, logger)
File "/repo_path/train.py", line 167, in <module>
main()
File "/path/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/path/runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "/path/runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "/path/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/path/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
@dvirginz This is likely just the first of a few errors that you would get if you pickle an Experiment object. Now the question is: why is ddp attempting to pickle the Experiment when the logger is enabled?
We're looking into this.
The way pl is working on multi node env is to pickle and send the whole model. I assume that when you guys are running on ddp envs the experiment object stayes on the main process? If so, how can you log per process events? (for example specific samples that happened to end up at gpu x that we would like to log?)
Anyhow, thanks for looking into that 🙂
@dvirginz Experiments don't all have to be on the main process. For example, we have our own runner which can be used with or without our Optimizer, and you can coordinate which GPU a process runs on. For more details on that see: https://www.comet.ml/docs/command-line/#comet-optimize
In general, the Experiment object wasn't designed to be pickled as it has live connections to the server. You can work around that though. For example, there is the ExistingExperiment, but also you can just delay creating the Experiment until you are in the process or thread.
Let me know if you would like more information on any of the above.
So if I understand correctly, you suggest not to use comet-pytorch_lightning logging infrastructure, but for example, pass the ExistingExperiment id as a hyperparameter, create the logging object inside the thread, and take care of the logging myself(instead of the automatic logging using CometLogger object)?
I'll try that and update.
@dvirginz Could you post your Trainer parameters?
Hi @dsblank , we talked about it on slack too. Even the simplest configuration it happens, as each distributed process creates a new experiment.
trainer = pl.Trainer(
logger= CometLogger(
api_key="ID",
workspace="User",
project_name="proj",
experiment_key=experiment_id,
),
auto_select_gpus=True,
gpus=3,
distributed_backend="ddp",
)
(I'm still on 3.1.8)
@dvirginz Update: we are preparing some documentation to help wrestle with this issue. Here is one example with some hints that may help: https://github.com/comet-ml/comet-examples/tree/master/pytorch#using-cometml-with-pytorch-parallel-data-training
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue was closed because it has been stalled for 5 days with no activity.