PaddleOCR PPOCRv3-recognition on two v100 is taking too long

PPOCRv3-recognition on two v100 is taking too long

Open bely66 opened this issue 2 years ago • 2 comments

I'm training my dataset with size 9M images on a 2xv100 machine I'm setting the number of epochs to 500 The time estimated for training is more than 200 days Which is too long given that this is close to your setup with 4xv100 machine

I'm using the docker version of paddle to train the model. Cuda: 10.2 Cudnn : 7.6

Also I'm noticing that the GPU utilization is very low during training:

Oct 28 '22 10:10 bely66

@WenmuZhou would it be hard to shorten the training time and optimize GPU utilization?

Oct 30 '22 07:10 bely66

The time estimated for training is more than 200 days

If the amount of data is very large, in the case of loading the pretrained model, there is no need to train so many epochs. When eval acc meets the requirements, you can stop training

would it be hard to shorten the training time and optimize GPU utilization?

GPU utilization should not be this low, try increasing num_worker . When we use v100 training, the average utilization can reach more than 80%. Also, what is the paddle version you are using?

Nov 04 '22 07:11 tink2123

PaddleOCR PaddleOCR copied to clipboard

PPOCRv3-recognition on two v100 is taking too long

PaddleOCR
PaddleOCR copied to clipboard