pytorch-YOLOv4 Does anyone encounter the situation that CPU can run but GPU will be stuck in the first epoch?

Does anyone encounter the situation that CPU can run but GPU will be stuck in the first epoch?

Open EternalEvan opened this issue 3 years ago • 5 comments

Does anyone encounter the situation that CPU can run but GPU will be stuck in the first epoch? Results can be obtained when training with CPU. But when I train my own data with GPU, I will be stuck here. Can someone help me? CPU： 2020-10-29 23:42:12,062 train.py[line:611] INFO: Using device cpu 2020-10-29 23:42:13,583 train.py[line:327] INFO: Starting training: Epochs: 5 Batch size: 4 Subdivisions: 1 Learning rate: 0.001 Training size: 21 Validation size: 4 Checkpoints: True Device: cpu Images size: 608 Optimizer: adam Dataset classes: 3 Train label path:train.txt Pretrained:

Epoch 1/5: 95%|▉| 20/21 [10:25<00:31, 31.65s/img]in function convert_to_coco_api... You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. creating index... index created! You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. Accumulating evaluation results... DONE (t=0.13s). IoU metric: bbox Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000：

GPU： 2020-10-30 13:54:17,456 train.py[line:611] INFO: Using device cuda 2020-10-30 13:54:20,094 train.py[line:327] INFO: Starting training: Epochs: 5 Batch size: 4 Subdivisions: 1 Learning rate: 0.001 Training size: 21 Validation size: 4 Checkpoints: True Device: cuda Images size: 608 Optimizer: adam Dataset classes: 3 Train label path:train.txt Pretrained:

Epoch 1/5: 95%|▉| 20/21 [00:17<00:01, 1.01s/img]in function convert_to_coco_api... You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. You could also create your own 'get_image_id' function. creating index... index created!

Oct 30 '20 06:10 EternalEvan

I meet the same problem. Do you find anything?

Nov 10 '20 06:11 uniyushu

This might be because your evaluation dataset is large. It appears the evaluation dataset is evaluated on the CPU, though I'm not so sure - that would be one explanation for the slowness.

Jan 20 '21 22:01 gytdau

Try set dataloader num_workers=0. val_loader = DataLoader(val_dataset, batch_size=config.batch // config.subdivisions, shuffle=True, num_workers=0, pin_memory=True, drop_last=True, collate_fn=val_collate) It seems like a bug in pytorch dataloder.

Feb 01 '21 05:02 swxu

You need to change Pytorch version, I changed it to 1.5.0, and the train.py run successfully with GPU

Mar 06 '21 17:03 asebaq

@swxu @asebaq I was going crazy debugging. Thanks.

Nov 11 '21 19:11 jcmayoral

pytorch-YOLOv4 pytorch-YOLOv4 copied to clipboard

Does anyone encounter the situation that CPU can run but GPU will be stuck in the first epoch?

pytorch-YOLOv4
pytorch-YOLOv4 copied to clipboard