DiffBIR icon indicating copy to clipboard operation
DiffBIR copied to clipboard

Failure during training

Open kirrukirru opened this issue 1 year ago • 3 comments

I started training on A100 GPU with about 2000 training images. It completed about 900 Epochs, then the process ended abruptly without any errors. I can see several checkpoint step files. I also tried to restart the traning by setting the resume path to the folder containing step files. But gives error {folder} is a directory. Any help would be highly appreciated.

Thanks

kirrukirru avatar Sep 20 '23 04:09 kirrukirru