Disable validation during training?
I just started working with RF-DETR, and while training is going smoothly, validation seems to crash on almost every epoch (but not every epoch), not necessarily with a consistent error or at a consistent point, typically with a torch.distributed.elastic.multiprocessing.errors.ChildFailedError (I am training on two GPUs).
Rather than trying to debug this, I'd like to just disable validation entirely, and run val on every checkpoint at the end. Is that possible? I don't see any equivalent of the "run_test" argument (e.g. "run_val", or "disable_val"). If there's not functionality for this, is there a recommended workaround, e.g. is an an empty "valid" folder allowable? Or maybe having a single image in the "valid" folder is the closest I can do?
Thanks!
@agentmorris Every checkpoint is around 450 MB. Run for 20 epochs, and that will take up 20 * 450 MB. That is quite an expensive way of training.
When you run val on all 20 checkpoints, it's not going to save you any time, AFIK. It is the same computation, but now just with extra steps.
One image val split is a quick way to delay running the validation on all checkpoints. But again, you will end up spending the same amount of time when you run validation on the real val split.
I ran into similar problems. I was able to speed up 7 times on my use case. Not sure if that will also hold true for you. Hope this helps. https://github.com/roboflow/rf-detr/issues/416#issuecomment-3454074446
Thanks for the suggestions! I'm not trying to save time in this case, though, I'm just trying to make it through training without crashing during validation, which seems for whatever reason to be less stable than training on my setup (you would think the opposite, but no, I've seen zero crashes during training and around a 50% probability of crashing during each validation pass). The size of the checkpoints is negligible compared to my training data size, so I always save every epoch's checkpoint anyway.
If I don't stumble on a way to disable validation that I'm missing, I'll cheat with the single-image validation set and hope that reduces the probability of crashing during validation to epsilon.
Which model are you using? What is your dataset image resolution?
During the use of the segmentation model, 13,000 output masks per image are resized to their original dataset-resolution before coco evaluation is called. That used to cause OOM failures for me during eval. This resizing doesn't happen in training. So training is always smooth. Validation used to fail.
I'm currently testing with RF-DETR Nano, and training at an image size of 640px. Images are 1600px on the long edge. This is a test pass to get everything set up before a larger training pass that will use a higher resolution and a larger model. My sense is that the crashing is not data-specific, given that it doesn't happen 100% of the time, rather that it's some combination of CUDA configuration, Linux version, PyTorch version, and other system properties that will be a hassle to debug, hence my inclination to just defer validation.
@agentmorris, PR to disable/control validation: https://github.com/roboflow/rf-detr/pull/452
The PR, along with the following code snippet, is able to skip the evaluation process altogether.
In the example below, epochs is set to 5, and validation_interval is set to 6.
from rfdetr import RFDETRNano
model = RFDETRNano()
model.train(
dataset_dir="/home/abdul/projects/rf-detr/datasets_downloads/basketball-player-detection-2",
epochs=5,
batch_size=2,
grad_accum_steps=1,
lr=1e-4,
num_workers=2,
device='cuda',
checkpoint_interval=1,
validation_interval=6,
)