yolov9 icon indicating copy to clipboard operation
yolov9 copied to clipboard

Training is too slow

Open itachi1232gg opened this issue 10 months ago • 16 comments

I would like to do a comparison between yolov6-L6 and yolov9-e on my own dataset. Training yolov6-L6 spends around 20 mins per epoch, while training yolov9-e spends 120 mins. I train yolov9-e by this shell script: python -m torch.distributed.launch --nproc_per_node 4 --master_port 9527 train_dual.py --workers 8 --device 0,1,2,3 --sync-bn --batch 64

itachi1232gg avatar Apr 17 '24 09:04 itachi1232gg

Try lower --worker and --batch size. For example, use --worker 1 --batch 2. This works on my laptop (with single RTX 2070 card).

FirokOtaku avatar Apr 18 '24 08:04 FirokOtaku

Try lower --worker and --batch size. For example, use --worker 1 --batch 2. This works on my laptop (with single RTX 2070 card).

thanks for the advice , but this does not work on my 4*80G A100 + dual Intel Gold 6248R CPU

itachi1232gg avatar Apr 18 '24 08:04 itachi1232gg

Do you solve that? That issue happens to me too! mine is RTX4080

TakeshiGianlee avatar Apr 18 '24 18:04 TakeshiGianlee

I just tried to decline my batchsize from 16 to 8 and decline my num_woker from 20 to 16 and it looks faster than before

TakeshiGianlee avatar Apr 18 '24 19:04 TakeshiGianlee

@itachi1232gg torch.distributed.launch is deprecated; maybe you should try this command: torchrun --standalone --nnodes=1 --nproc-per-node=4 train_dual.py --worker 8 --device 0, 1, 2, 3 --sync-bn --batch 64. You can visit torchrun's document to see the meaning of the arguments. I think the best practice is to set:

  • the nproc-per-node argument equal to the number of your GPUs
  • the worker argument bigger than or equal to the number of your CPUs, YOLOv9's source code will automatically choose the number of dataloaders equal to the number of your CPUs.
  • the batch argument to be divisible by the number of your GPUs I hope it will help you.

minhnhathcmus avatar Apr 19 '24 04:04 minhnhathcmus

I just tried to decline my batchsize from 16 to 8 and decline my num_woker from 20 to 16 and it looks faster than before

tried it, not working

itachi1232gg avatar Apr 20 '24 10:04 itachi1232gg

@itachi1232gg torch.distributed.launch is deprecated; maybe you should try this command: torchrun --standalone --nnodes=1 --nproc-per-node=4 train_dual.py --worker 8 --device 0, 1, 2, 3 --sync-bn --batch 64. You can visit torchrun's document to see the meaning of the arguments. I think the best practice is to set:

  • the nproc-per-node argument equal to the number of your GPUs
  • the worker argument bigger than or equal to the number of your CPUs, YOLOv9's source code will automatically choose the number of dataloaders equal to the number of your CPUs.
  • the batch argument to be divisible by the number of your GPUs I hope it will help you.

Thanks for the advice, but to my knowledge, torchrun does not make the training faster, it's only a new function to launch the DDP, the trianing still use to DDP to split data and model to multiple GPUs.

itachi1232gg avatar Apr 20 '24 10:04 itachi1232gg

  1. Lowering the batch size (from 64 -> 48) : not working.
  2. Using torchrun to start the training: not working
  3. Using --cache ram to cache the data into memory: consumes around 600GB RAM, but does not accelerate the training. It looks like the bottleneck is on the computing, not on the data loading or something else. image

itachi1232gg avatar Apr 20 '24 10:04 itachi1232gg

I've experienced similar situations with slow training speed, which I noticed the GPU usage fluctuates (between 0-90%) during the training, indicating that the GPU is not fully utilized.

Turns out there are some bottleneck in preprocessing (especially Mosaic augmentation), which is CPU bound. Then I tried tune down hyper-parameter mosaic: 1.0 -> mosaic:0.5, or even mosaic: 0.0; the training speed increased, but with compromise in mAP.

ethanlee928 avatar Apr 22 '24 04:04 ethanlee928

I've experienced similar situations with slow training speed, which I noticed the GPU usage fluctuates (between 0-90%) during the training, indicating that the GPU is not fully utilized.

Turns out there are some bottleneck in preprocessing (especially Mosaic augmentation), which is CPU bound. Then I tried tune down hyper-parameter mosaic: 1.0 -> mosaic:0.5, or even mosaic: 0.0; the training speed increased, but with compromise in mAP.

how much does the training speed increase when you close the mosaic?
Mosaic do slow the training, but Yolov5 to v9 use the same code to do the Mosaic, training yolov5 to v8 is not as slow as yolov9. Training yolov6-L6(largest yolov6 model and is larger than yolov9-e) spends 17 mins per epoch while training yolov9-e spends 3 hours, it's more than 10x slower.

itachi1232gg avatar Apr 22 '24 05:04 itachi1232gg

I found decrease image size also helped speeding up. When I change img from 800 to 640, the training time decreased from 52m/epoch to 22m/epoch. But not sure if any compromise happened.

Ying5775 avatar Apr 26 '24 14:04 Ying5775

Same question.

I try to compare yolovx and yolov9 on my dataset.

My GPUs are four V100 32G [yolovx - DP] 10 mins per epoch [yolov9 - DP] 40 mins per epoch

After that, I use DDP and larger batch size (256 to 416 so that use all 32G GPU) then it is about 19 mins per epoch which is still very slow.

Then I change the validation epochs from each to every 5 epochs so that it can save a little time

QuarTerll avatar May 07 '24 11:05 QuarTerll

@itachi1232gg What's now? Do you solve it?

@WongKinYiu Do you ever met this issues when you guys do experiments?

QuarTerll avatar May 08 '24 07:05 QuarTerll

@QuarTerll You may want to train yolov9 using Ultralytics implementation. see https://docs.ultralytics.com/models/yolov9/ I trained yolov9e by using Ultralytics , each epoch spent around 40 mins while this repo spent 3 hours. The final eval results are almost same.

itachi1232gg avatar May 08 '24 07:05 itachi1232gg

@QuarTerll You may want to train yolov9 using Ultralytics implementation. see https://docs.ultralytics.com/models/yolov9/ I trained yolov9e by using Ultralytics , each epoch spent around 40 mins while this repo spent 3 hours. The final eval results are almost same.

OK fine. Thanks.

I thought the train code is almost same I will try it

QuarTerll avatar May 08 '24 08:05 QuarTerll

I find that YOLOv9 is slower than any other version, even when the number of parameters is similar.

magic524 avatar Jun 14 '24 08:06 magic524