pytorch-YOLOv4 icon indicating copy to clipboard operation
pytorch-YOLOv4 copied to clipboard

Does anyone encounter the situation that CPU can run but GPU will be stuck in the first epoch?

Open EternalEvan opened this issue 3 years ago • 5 comments

Does anyone encounter the situation that CPU can run but GPU will be stuck in the first epoch? Results can be obtained when training with CPU. But when I train my own data with GPU, I will be stuck here. Can someone help me? CPU: 2020-10-29 23:42:12,062 train.py[line:611] INFO: Using device cpu 2020-10-29 23:42:13,583 train.py[line:327] INFO: Starting training: Epochs: 5 Batch size: 4 Subdivisions: 1 Learning rate: 0.001 Training size: 21 Validation size: 4 Checkpoints: True Device: cpu Images size: 608 Optimizer: adam Dataset classes: 3 Train label path:train.txt Pretrained:

Epoch 1/5: 95%|▉| 20/21 [10:25<00:31, 31.65s/img]in function convert_to_coco_api... You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. creating index... index created! You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. Accumulating evaluation results... DONE (t=0.13s). IoU metric: bbox Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000:

GPU: 2020-10-30 13:54:17,456 train.py[line:611] INFO: Using device cuda 2020-10-30 13:54:20,094 train.py[line:327] INFO: Starting training: Epochs: 5 Batch size: 4 Subdivisions: 1 Learning rate: 0.001 Training size: 21 Validation size: 4 Checkpoints: True Device: cuda Images size: 608 Optimizer: adam Dataset classes: 3 Train label path:train.txt Pretrained:

Epoch 1/5: 95%|▉| 20/21 [00:17<00:01, 1.01s/img]in function convert_to_coco_api... You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. creating index... index created!

EternalEvan avatar Oct 30 '20 06:10 EternalEvan

I meet the same problem. Do you find anything?

uniyushu avatar Nov 10 '20 06:11 uniyushu

This might be because your evaluation dataset is large. It appears the evaluation dataset is evaluated on the CPU, though I'm not so sure - that would be one explanation for the slowness.

gytdau avatar Jan 20 '21 22:01 gytdau

Try set dataloader num_workers=0. val_loader = DataLoader(val_dataset, batch_size=config.batch // config.subdivisions, shuffle=True, num_workers=0, pin_memory=True, drop_last=True, collate_fn=val_collate) It seems like a bug in pytorch dataloder.

swxu avatar Feb 01 '21 05:02 swxu

You need to change Pytorch version, I changed it to 1.5.0, and the train.py run successfully with GPU

asebaq avatar Mar 06 '21 17:03 asebaq

@swxu @asebaq I was going crazy debugging. Thanks.

jcmayoral avatar Nov 11 '21 19:11 jcmayoral