DiffBIR Failure during training

I started training on A100 GPU with about 2000 training images. It completed about 900 Epochs, then the process ended abruptly without any errors. I can see several checkpoint step files. I also tried to restart the traning by setting the resume path to the folder containing step files. But gives error {folder} is a directory. Any help would be highly appreciated.

Thanks

Sep 20 '23 04:09 kirrukirru

Hello! For the first issue, the training will stop when you reach the maximum training steps, and you can modify your maximum training steps by setting max_steps in your training configuration file. For the second issue, the parameter resume should be set to the path of a step file rather than the folder containing step files.

Sep 21 '23 11:09 0x3f3f3f3fun

Hello, is there a way to train the model on a GPU with 10GB of VRAM like with tiling during inference? Failing to do complete Stage 1 training due to insufficient memory.

Oct 09 '23 21:10 AldenBoby

Hello, is there a way to train the model on a GPU with 10GB of VRAM like with tiling during inference? Failing to do complete Stage 1 training due to insufficient memory.

Have you done it?

May 22 '24 12:05 Chantec