taming-transformers
taming-transformers copied to clipboard
Can't train used to multi gpu
thank you for making this code public.
I want to train vqgan from scrach, but error like the one of below
python3 main.py --base configs/imagenet_vqgan.yaml -t True --gpus 0,1,2,3,4,5,6,7
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 208, in _wrapped_function
result = function(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/plugins/training_type/ddp_spawn.py", line 236, in new_process
results = trainer.run_stage()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1289, in run_stage
return self._run_train()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1306, in _run_train
self._pre_training_routine()
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1301, in _pre_training_routine
self.call_hook("on_pretrain_routine_start")
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/trainer.py", line 1495, in call_hook
callback_fx(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/pytorch_lightning/trainer/callback_hook.py", line 148, in on_pretrain_routine_start
callback.on_pretrain_routine_start(self, self.lightning_module)
File "/****/Projects/taming-transformers/main.py", line 200, in on_pretrain_routine_start
OmegaConf.save(self.config,
File "/usr/local/lib/python3.8/dist-packages/omegaconf/omegaconf.py", line 220, in save
with io.open(os.path.abspath(f), "w", encoding="utf-8") as file:
FileNotFoundError: [Errno 2] No such file or directory: '/***/Projects/taming-transformers/logs/2022-10-18T19-25-09_imagenet_vqgan/configs/2022-10-18T19-25-09-project.yaml'
I had this trouble yet, fixed as : delete this file in log or u better delete all file in log file.and try again.
I had the same issue in slurm. I guess the problem is caused by the following code. https://github.com/CompVis/taming-transformers/blob/3ba01b241669f5ade541ce990f7650a3b8f65318/main.py#L206-L215
After the log directory is created, another process move the log directory to "child_runs", so OmegaConf cannot create new file in an unexisted directory.
So I just removed this and it seems is runnable.