PaddleOCR icon indicating copy to clipboard operation
PaddleOCR copied to clipboard

PPOCRv3-recognition on two v100 is taking too long

Open bely66 opened this issue 2 years ago • 2 comments

I'm training my dataset with size 9M images on a 2xv100 machine I'm setting the number of epochs to 500 The time estimated for training is more than 200 days Which is too long given that this is close to your setup with 4xv100 machine

I'm using the docker version of paddle to train the model. Cuda: 10.2 Cudnn : 7.6

Also I'm noticing that the GPU utilization is very low during training: image

bely66 avatar Oct 28 '22 10:10 bely66

@WenmuZhou would it be hard to shorten the training time and optimize GPU utilization?

bely66 avatar Oct 30 '22 07:10 bely66

The time estimated for training is more than 200 days

If the amount of data is very large, in the case of loading the pretrained model, there is no need to train so many epochs. When eval acc meets the requirements, you can stop training

would it be hard to shorten the training time and optimize GPU utilization?

GPU utilization should not be this low, try increasing num_worker . When we use v100 training, the average utilization can reach more than 80%. Also, what is the paddle version you are using?

tink2123 avatar Nov 04 '22 07:11 tink2123