yolov9
yolov9 copied to clipboard
Training is too slow
I would like to do a comparison between yolov6-L6 and yolov9-e on my own dataset. Training yolov6-L6 spends around 20 mins per epoch, while training yolov9-e spends 120 mins. I train yolov9-e by this shell script:
python -m torch.distributed.launch --nproc_per_node 4 --master_port 9527 train_dual.py --workers 8 --device 0,1,2,3 --sync-bn --batch 64
Try lower --worker
and --batch
size. For example, use --worker 1 --batch 2
. This works on my laptop (with single RTX 2070 card).
Try lower
--worker
and--batch
size. For example, use--worker 1 --batch 2
. This works on my laptop (with single RTX 2070 card).
thanks for the advice , but this does not work on my 4*80G A100 + dual Intel Gold 6248R CPU
Do you solve that? That issue happens to me too! mine is RTX4080
I just tried to decline my batchsize from 16 to 8 and decline my num_woker from 20 to 16 and it looks faster than before
@itachi1232gg torch.distributed.launch
is deprecated; maybe you should try this command: torchrun --standalone --nnodes=1 --nproc-per-node=4 train_dual.py --worker 8 --device 0, 1, 2, 3 --sync-bn --batch 64
. You can visit torchrun
's document to see the meaning of the arguments. I think the best practice is to set:
- the
nproc-per-node
argument equal to the number of your GPUs - the
worker
argument bigger than or equal to the number of your CPUs, YOLOv9's source code will automatically choose the number of dataloaders equal to the number of your CPUs. - the
batch
argument to be divisible by the number of your GPUs I hope it will help you.
I just tried to decline my batchsize from 16 to 8 and decline my num_woker from 20 to 16 and it looks faster than before
tried it, not working
@itachi1232gg
torch.distributed.launch
is deprecated; maybe you should try this command:torchrun --standalone --nnodes=1 --nproc-per-node=4 train_dual.py --worker 8 --device 0, 1, 2, 3 --sync-bn --batch 64
. You can visittorchrun
's document to see the meaning of the arguments. I think the best practice is to set:
- the
nproc-per-node
argument equal to the number of your GPUs- the
worker
argument bigger than or equal to the number of your CPUs, YOLOv9's source code will automatically choose the number of dataloaders equal to the number of your CPUs.- the
batch
argument to be divisible by the number of your GPUs I hope it will help you.
Thanks for the advice, but to my knowledge, torchrun does not make the training faster, it's only a new function to launch the DDP, the trianing still use to DDP to split data and model to multiple GPUs.
- Lowering the batch size (from 64 -> 48) : not working.
- Using
torchrun
to start the training: not working - Using
--cache ram
to cache the data into memory: consumes around 600GB RAM, but does not accelerate the training. It looks like the bottleneck is on the computing, not on the data loading or something else.
I've experienced similar situations with slow training speed, which I noticed the GPU usage fluctuates (between 0-90%) during the training, indicating that the GPU is not fully utilized.
Turns out there are some bottleneck in preprocessing (especially Mosaic augmentation), which is CPU bound. Then I tried tune down hyper-parameter mosaic: 1.0
-> mosaic:0.5
, or even mosaic: 0.0
; the training speed increased, but with compromise in mAP.
I've experienced similar situations with slow training speed, which I noticed the GPU usage fluctuates (between 0-90%) during the training, indicating that the GPU is not fully utilized.
Turns out there are some bottleneck in preprocessing (especially Mosaic augmentation), which is CPU bound. Then I tried tune down hyper-parameter
mosaic: 1.0
->mosaic:0.5
, or evenmosaic: 0.0
; the training speed increased, but with compromise in mAP.
how much does the training speed increase when you close the mosaic?
Mosaic do slow the training, but Yolov5 to v9 use the same code to do the Mosaic, training yolov5 to v8 is not as slow as yolov9.
Training yolov6-L6(largest yolov6 model and is larger than yolov9-e) spends 17 mins per epoch while training yolov9-e spends 3 hours, it's more than 10x slower.
I found decrease image size also helped speeding up. When I change img from 800 to 640, the training time decreased from 52m/epoch to 22m/epoch. But not sure if any compromise happened.
Same question.
I try to compare yolovx and yolov9 on my dataset.
My GPUs are four V100 32G [yolovx - DP] 10 mins per epoch [yolov9 - DP] 40 mins per epoch
After that, I use DDP and larger batch size (256 to 416 so that use all 32G GPU) then it is about 19 mins per epoch which is still very slow.
Then I change the validation epochs from each to every 5 epochs so that it can save a little time
@itachi1232gg What's now? Do you solve it?
@WongKinYiu Do you ever met this issues when you guys do experiments?
@QuarTerll You may want to train yolov9 using Ultralytics implementation. see https://docs.ultralytics.com/models/yolov9/ I trained yolov9e by using Ultralytics , each epoch spent around 40 mins while this repo spent 3 hours. The final eval results are almost same.
@QuarTerll You may want to train yolov9 using Ultralytics implementation. see https://docs.ultralytics.com/models/yolov9/ I trained yolov9e by using Ultralytics , each epoch spent around 40 mins while this repo spent 3 hours. The final eval results are almost same.
OK fine. Thanks.
I thought the train code is almost same I will try it
I find that YOLOv9 is slower than any other version, even when the number of parameters is similar.