latent-diffusion icon indicating copy to clipboard operation
latent-diffusion copied to clipboard

Something error with the training on imagenet

Open GuoxingY opened this issue 2 years ago • 6 comments

It seems that something error with the training code. When I run the code to train the model on ImageNet with cin-ldm-vq-f8.yaml, there is the error occured.

Traceback (most recent call last): File "main.py", line 722, in trainer.fit(model, data) File "/root/miniforge3/envs/guoxing/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit self._call_and_handle_interrupt( File "/root/miniforge3/envs/guoxing/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 719, in _call_and_handle_interrupt return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs) File "/root/miniforge3/envs/guoxing/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch return function(*args, **kwargs) File "/root/miniforge3/envs/guoxing/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/root/miniforge3/envs/guoxing/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1234, in _run results = self._run_stage() File "/root/miniforge3/envs/guoxing/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1321, in _run_stage return self._run_train() File "/root/miniforge3/envs/guoxing/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1351, in _run_train self.fit_loop.run() File "/root/miniforge3/envs/guoxing/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 205, in run self️️.on_advance_end() File "/root/miniforge3/envs/guoxing/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 297, in on_advance_end self.trainer._call_callback_hooks("on_train_epoch_end") File "/root/miniforge3/envs/guoxing/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1634, in _call_callback_hooks fn(self, self.lightning_module, *args, **kwargs) TypeError: on_train_epoch_end() missing 1 required positional argument: 'outputs'

I have never modified the code and run the code with default setting. Could you help me to solve this problem?

GuoxingY avatar Jun 08 '22 16:06 GuoxingY

hi, Have you solved this problem, can you tell me the specific operation, thank you very much!

RachelWang122 avatar Jun 10 '22 03:06 RachelWang122

hi, Have you solved this problem, can you tell me the specific operation, thank you very much!

I have no idea about this. I just simply skip this function by modifying the code of pytroch_lightning so that the model can be trained with more than one epoch. It seems that skipping this function has no influence on training results (I am not sure cause I only train it with 5 epcoh).

GuoxingY avatar Jun 10 '22 04:06 GuoxingY

Hello, I wonder if it is convenient to add WeChat to exchange this pytorch-lightning code .

------------------ Original message ------------------ From: "GuoxingYang"; Sendtime: Friday, Jun 10, 2022 12:15 PM To: "CompVis/latent-diffusion"; Cc: @.***>; "Comment"; Subject: Re: [CompVis/latent-diffusion] Something error with the training on imagenet (Issue #85)

hi, Have you solved this problem, can you tell me the specific operation, thank you very much!

I have no idea about this. I just simply skip this function by modifying the code of pytroch_lightning so that the model can be trained with more than one epoch. It seems that skipping this function has no influence on training results (I am not sure cause I only train it with 5 epcoh).

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

RachelWang122 avatar Jun 10 '22 08:06 RachelWang122

Hello, I wonder if it is convenient to add WeChat to exchange this pytorch-lightning code . ------------------ Original message ------------------ From: "GuoxingYang"; Sendtime: Friday, Jun 10, 2022 12:15 PM To: "CompVis/latent-diffusion"; Cc: @.>; "Comment"; Subject: Re: [CompVis/latent-diffusion] Something error with the training on imagenet (Issue #85) hi, Have you solved this problem, can you tell me the specific operation, thank you very much! I have no idea about this. I just simply skip this function by modifying the code of pytroch_lightning so that the model can be trained with more than one epoch. It seems that skipping this function has no influence on training results (I am not sure cause I only train it with 5 epcoh). — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.>

Just see the code in line 1634 of lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py, which can be found in the root of your anaconda enviroment. It is quite easy to add like if hook_name != 'on_training_epoch_end' : fn(xxxxx)

GuoxingY avatar Jun 10 '22 09:06 GuoxingY

Hello, I wonder if it is convenient to add WeChat to exchange this pytorch-lightning code . ------------------ Original message ------------------ From: "GuoxingYang"; Sendtime: Friday, Jun 10, 2022 12:15 PM To: "CompVis/latent-diffusion"; Cc: @.>; "Comment"; Subject: Re: [CompVis/latent-diffusion] Something error with the training on imagenet (Issue #85) hi, Have you solved this problem, can you tell me the specific operation, thank you very much! I have no idea about this. I just simply skip this function by modifying the code of pytroch_lightning so that the model can be trained with more than one epoch. It seems that skipping this function has no influence on training results (I am not sure cause I only train it with 5 epcoh). — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: _@**.**_>

Just see the code in line 1634 of lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py, which can be found in the root of your anaconda enviroment. It is quite easy to add like if hook_name != 'on_training_epoch_end' : fn(xxxxx)

ok. thanks a lot.

RachelWang122 avatar Jun 10 '22 12:06 RachelWang122

I ran into the same issue as well, however, I modified the code and then it worked. In the main,py file, there is a class CUDACallback which has the method on_train_epoch_end which is giving errors. there is the outputs parameter passed as argument, however, it is not used in the method. so i deleted it from the arguments. It worked and my models are training

naveedunjum avatar Jun 23 '22 17:06 naveedunjum