DiffBIR icon indicating copy to clipboard operation
DiffBIR copied to clipboard

Failure during training

Open kirrukirru opened this issue 2 years ago • 3 comments

I started training on A100 GPU with about 2000 training images. It completed about 900 Epochs, then the process ended abruptly without any errors. I can see several checkpoint step files. I also tried to restart the traning by setting the resume path to the folder containing step files. But gives error {folder} is a directory. Any help would be highly appreciated.

Thanks

kirrukirru avatar Sep 20 '23 04:09 kirrukirru

Hello! For the first issue, the training will stop when you reach the maximum training steps, and you can modify your maximum training steps by setting max_steps in your training configuration file. For the second issue, the parameter resume should be set to the path of a step file rather than the folder containing step files.

0x3f3f3f3fun avatar Sep 21 '23 11:09 0x3f3f3f3fun

Hello, is there a way to train the model on a GPU with 10GB of VRAM like with tiling during inference? Failing to do complete Stage 1 training due to insufficient memory.

AldenBoby avatar Oct 09 '23 21:10 AldenBoby

Hello, is there a way to train the model on a GPU with 10GB of VRAM like with tiling during inference? Failing to do complete Stage 1 training due to insufficient memory.

Have you done it?

Chantec avatar May 22 '24 12:05 Chantec