Failure during training
I started training on A100 GPU with about 2000 training images. It completed about 900 Epochs, then the process ended abruptly without any errors. I can see several checkpoint step files.
I also tried to restart the traning by setting the resume path to the folder containing step files. But gives error {folder} is a directory.
Any help would be highly appreciated.
Thanks
Hello! For the first issue, the training will stop when you reach the maximum training steps, and you can modify your maximum training steps by setting max_steps in your training configuration file. For the second issue, the parameter resume should be set to the path of a step file rather than the folder containing step files.
Hello, is there a way to train the model on a GPU with 10GB of VRAM like with tiling during inference? Failing to do complete Stage 1 training due to insufficient memory.
Hello, is there a way to train the model on a GPU with 10GB of VRAM like with tiling during inference? Failing to do complete Stage 1 training due to insufficient memory.
Have you done it?