PyTorch-YOLOv3 Training is slow on Google Cloud Nvidia Tesla P100

I am running my model on Google Cloud with 8vCPUs and 52GB RAM and 1GPU - Nvidia Tesla P100 with a batch size of 16. But, It is taking around 8 hours for one epoch from what I calculated.

I am running the training on COCO dataset.

!!! torch.cuda.is_available shows True for me when I print it to console..

Could someone tell me where I might be going wrong ?

Oct 20 '18 20:10 Muthu2093

Your GCP hard drive is IO constrained. Switch to SSD and/or increase size.

Oct 21 '18 12:10 glenn-jocher

My instance has 50GB SSD Persistent Disk Already. Do I need more?

I am mounting the resources (code+dataset) from a Google Cloud Bucket. Will it affect the training speed?

Oct 22 '18 03:10 Muthu2093

Same Issue! Machine : 8 * 2080ti, 96GB DDR4 ,1TB SSD, Xeon 6134, 3.2G,4 Core * 2 Train Data set : VOC, 10k imgs Batch size : 12 * 8 1 epoch cost: 14 mins! It seems to create threads every step. and load data. gpus works 0.5sec and wait 3secs.

please share speed up tips if you have anyone. Thanks.

Jan 28 '19 01:01 tensmyo

Hi, there have been improvements to the dataloader during the past week or so and I have measured a significant speedup. One epoch for me (with a 2080ti) takes approximately one hour now, with a 8 sample batch size.

Apr 27 '19 20:04 eriklindernoren

Hi, there have been improvements to the dataloader during the past week or so and I have measured a significant speedup. One epoch for me (with a 2080ti) takes approximately one hour now, with a 8 sample batch size.

It still seems to create threads every step. and load data. gpus works 0.5sec and wait 3secs. Volatile GPU-Util is always changing. and I use 16 n_cpu, they are all 100% even more.

Dec 19 '19 11:12 CuiHaoran98

Hi, there have been improvements to the dataloader during the past week or so and I have measured a significant speedup. One epoch for me (with a 2080ti) takes approximately one hour now, with a 8 sample batch size.

It still seems to create threads every step. and load data. gpus works 0.5sec and wait 3secs. Volatile GPU-Util is always changing. and I use 16 n_cpu, they are all 100% even more.

Hi,did your question solved? my training speed is too slow,and gpu is Tesla T4,but it appears that the speed of it is as same as GTX 1050

Jan 28 '21 13:01 948024326

Is this issue still relevant/occurring?

Sep 14 '21 09:09 Flova

PyTorch-YOLOv3 PyTorch-YOLOv3 copied to clipboard

Training is slow on Google Cloud Nvidia Tesla P100

PyTorch-YOLOv3
PyTorch-YOLOv3 copied to clipboard