yolov5 How can I increase my training speed when training a large training set with Yolov5?

Search before asking

[x] I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

I'm using yolov5x to train on pictures with small flaws. The number of training sets is more than 2000, and it is usually enhanced 3 to 5 times. I want to make sure that I can detect small defects and improve the training speed as much as possible without changing the imgsz, but is there any good way to do this? My computer's graphics card is 3090. #13559

Additional

No response

Apr 16 '25 09:04 Eliana-23

👋 Hello @Eliana-23, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide a minimum reproducible example to help us debug it.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset image examples and training logs, and verify you are following our Tips for Best Training Results.

Requirements

Python>=3.8.0 with all requirements.txt installed including PyTorch>=1.8. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

This is an automated response. An Ultralytics engineer will assist you here soon! 😊

Apr 16 '25 09:04 UltralyticsAssistant

Hi @Eliana-23,

To improve training speed without changing image size, you can try:

Use mixed precision training with --half flag to leverage your 3090's Tensor Cores
Increase batch size (try 16, 32 or 64) for better GPU utilization
Enable image caching with --cache to reduce I/O bottlenecks
Consider using a slightly smaller model like YOLOv5l instead of YOLOv5x for a better speed/accuracy trade-off
Enable multi-scale training (--multi-scale) which can help with small defect detection while maintaining speed

For even more performance, you could explore Neural Magic's DeepSparse which provides GPU-class performance acceleration.

Let me know if you have questions about implementing any of these optimizations.

Apr 16 '25 20:04 pderrenger

Hello and thank you for your reply. At your suggestion, I enabled multi-size training and mixed-precision training. But the time has not become shorter, and the accuracy has not changed significantly, why is that? Is the number of my training sets too small? This is the result of two training runs. 1. Mixed-precision and multi-scale training are not turned on： 300 epochs completed in 0.839 hours. Optimizer stripped from runs\train\exp49\weights\last.pt, 173.5MB Optimizer stripped from runs\train\exp49\weights\best.pt, 173.5MB

Validating runs\train\exp49\weights\best.pt... Fusing layers... YOLOv5x summary: 322 layers, 86388742 parameters, 0 gradients, 204.4 GFLOPs Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 3/3 00:00 all 35 49 0.852 0.908 0.92 0.577 YSA 35 8 0.95 1 0.995 0.298 MX 35 4 0.603 0.405 0.413 0.297 FS 35 1 0.932 1 0.995 0.697 SJYS 35 1 0.718 1 0.995 0.895 NQYS 35 2 0.823 1 0.995 0.696 NQLD 35 4 0.91 1 0.995 0.732 PTYS 35 1 0.763 1 0.995 0.597 PTHX 35 1 0.916 1 0.995 0.597 PTZW 35 10 1 0.9 0.962 0.414 CRYS 35 12 0.925 0.5 0.629 0.305 STZZD 35 2 0.858 1 0.995 0.25 BY 35 1 0.772 1 0.995 0.895 YJQCJ 35 2 0.906 1 0.995 0.821 Results saved to runs\train\exp49 2. After turning on mixed-precision and multi-scale training 300 epochs completed in 1.046 hours. Optimizer stripped from runs\train\exp50\weights\last.pt, 173.5MB Optimizer stripped from runs\train\exp50\weights\best.pt, 173.5MB

Validating runs\train\exp50\weights\best.pt... Fusing layers... YOLOv5x summary: 322 layers, 86388742 parameters, 0 gradients, 204.4 GFLOPs Class Images Instances P R mAP50 mAP50-95: 100%|██████████| 3/3 00:00 all 35 49 0.736 0.804 0.886 0.455 YSA 35 8 0.918 1 0.995 0.303 MX 35 4 0.426 0.25 0.262 0.127 FS 35 1 1 0 0.995 0.398 SJYS 35 1 0.645 1 0.995 0.895 NQYS 35 2 0.311 0.703 0.695 0.526 NQLD 35 4 0.756 1 0.995 0.697 PTYS 35 1 0.903 1 0.995 0.129 PTHX 35 1 0.784 1 0.995 0.0995 PTZW 35 10 0.873 1 0.995 0.314 CRYS 35 12 0.789 0.5 0.613 0.235 STZZD 35 2 0.816 1 0.995 0.448 BY 35 1 0.624 1 0.995 0.895 YJQCJ 35 2 0.728 1 0.995 0.846 Results saved to runs\train\exp50

Apr 18 '25 09:04 Eliana-23

Here are my relevant parameter settings.

parser = argparse.ArgumentParser() parser.add_argument('--weights', type=str, default=R"D:\YoloV5\yolov5-7.0\yolov5x.pt", help='initial weights path') parser.add_argument('--cfg', type=str, default=ROOT / 'models\yolov5x.yaml', help='model.yaml path') parser.add_argument('--data', type=str, default=r"D:\YoloV5\yolov5-7.0\data\gjh417.yaml", help='dataset.yaml path') parser.add_argument('--hyp', type=str, default=ROOT / 'data/hyps/hyp.scratch-high.yaml', help='hyperparameters path') parser.add_argument('--epochs', type=int, default=300, help='total training epochs') parser.add_argument('--batch-size', type=int, default=8, help='total batch size for all GPUs, -1 for autobatch') parser.add_argument('--imgsz', '--img', '--img-size', type=int, default=640, help='train, val image size (pixels)') parser.add_argument('--rect', action='store_true', help='rectangular training') parser.add_argument('--resume', nargs='?', const=True, default=False, help='resume most recent training') parser.add_argument('--nosave', action='store_true', help='only save final checkpoint') parser.add_argument('--noval', action='store_true', help='only validate final epoch') parser.add_argument('--noautoanchor', action='store_true', help='disable AutoAnchor') parser.add_argument('--noplots', action='store_true', help='save no plot files') parser.add_argument('--evolve', type=int, nargs='?', const=300, help='evolve hyperparameters for x generations') parser.add_argument('--bucket', type=str, default='', help='gsutil bucket') parser.add_argument('--cache', type=str, nargs='?', default='ram', help='image --cache ram/disk') parser.add_argument('--image-weights', action='store_true', help='use weighted image selection for training') parser.add_argument('--device', default='0', help='cuda device, i.e. 0 or 0,1,2,3 or cpu') parser.add_argument('--multi-scale', action='store_true', default=True, help='vary img-size +/- 50%%') parser.add_argument('--single-cls', action='store_true', help='train multi-class data as single-class') parser.add_argument('--optimizer', type=str, choices=['SGD', 'Adam', 'AdamW'], default='AdamW', help='optimizer') parser.add_argument('--sync-bn', action='store_true', help='use SyncBatchNorm, only available in DDP mode') parser.add_argument('--workers', type=int, default=4, help='max dataloader workers (per RANK in DDP mode)') parser.add_argument('--project', default=ROOT / 'runs/train', help='save to project/name') parser.add_argument('--name', default='exp', help='save to project/name') parser.add_argument('--exist-ok', action='store_true', help='existing project/name ok, do not increment') parser.add_argument('--quad', action='store_true', help='quad dataloader') parser.add_argument('--cos-lr', action='store_true', help='cosine LR scheduler') parser.add_argument('--label-smoothing', type=float, default=0.0, help='Label smoothing epsilon') parser.add_argument('--patience', type=int, default=100, help='EarlyStopping patience (epochs without improvement)') parser.add_argument('--freeze', nargs='+', type=int, default=[0], help='Freeze layers: backbone=10, first3=0 1 2') parser.add_argument('--save-period', type=int, default=-1, help='Save checkpoint every x epochs (disabled if < 1)') parser.add_argument('--seed', type=int, default=0, help='Global training seed') parser.add_argument('--local_rank', type=int, default=-1, help='Automatic DDP Multi-GPU argument, do not modify') parser.add_argument('--half', action='store_true', help='use mixed precision training (FP16)')

Apr 18 '25 09:04 Eliana-23

Hi @Eliana-23,

Looking at your results, I see the issue - multi-scale and mixed precision actually increased your training time rather than decreasing it. Here's why that might be happening:

Your batch size of 8 is too small for your RTX 3090. Try increasing to at least 16 or 32 to better utilize your GPU.
Enable image caching with --cache ram parameter (I see it in your args but not clear if it's being used).
Multi-scale training typically adds processing overhead, which can slow things down unless you're utilizing the GPU efficiently with larger batches.
Your dataset seems relatively small (only 35 validation images across 14 classes), which might be contributing to the variability in results.

For better performance, I recommend:

Increase batch size to 16 or 32
Enable cache with --cache ram
Consider switching to YOLOv5l for training if YOLOv5x is not absolutely necessary
Try using --workers 8 to improve data loading speed

Let me know if these adjustments help improve your training speed!

Apr 18 '25 12:04 pderrenger

您好，

查看您的结果，我看到了问题所在 - 多尺度和混合精度实际上增加了您的训练时间，而不是减少它。以下是发生这种情况的原因：

您的批处理大小 8 对于您的 RTX 3090 来说太小了。尝试增加到至少 16 或 32 以更好地利用您的 GPU。

使用参数启用图像缓存（我在您的 args 中看到它，但不清楚它是否被使用）。--cache ram

多尺度训练通常会增加处理开销，这可能会减慢速度，除非您对更大的批次有效地利用 GPU。

您的数据集似乎相对较小（14 个类中只有 35 张验证图像），这可能会导致结果的可变性。

为了获得更好的性能，我建议：

将批处理大小增加到 16 或 32

启用缓存--cache ram

如果 YOLOv5x 不是绝对必要的，请考虑改用 YOLOv5l 进行训练

尝试使用以提高数据加载速度--workers 8

请告诉我这些调整是否有助于提高您的训练速度！

I would like to ask. Can I use the freeze function to freeze some specific layers in my model that I don't need to do and it has no effect on how well my model is trained. Will this be effective in shortening my training time?

Apr 21 '25 09:04 Eliana-23

Hi @Eliana-23,

Yes, freezing specific layers can significantly reduce training time while maintaining detection performance, especially when fine-tuning a pre-trained model for your specific defect detection task.

For YOLOv5x, you can freeze the backbone with --freeze 10 which keeps the feature extraction layers fixed while allowing the detection heads to adapt to your specific defects. This can cut training time considerably since you're updating far fewer parameters.

For best results with freezing:

Consider starting with --freeze 10 to freeze the backbone
You can also try --freeze 0 1 2 to freeze just the first few layers
If detection performance suffers, reduce the number of frozen layers

Combined with your other optimizations (increased batch size to 16/32, --cache ram, --workers 8), freezing should provide a substantial speed improvement without significantly affecting detection quality for small defects.

If you try this approach, I'd be interested to hear your results!

Apr 21 '25 22:04 pderrenger

423.xlsx Hello, this is the image score predicted by the model trained with 10 layers of freezing and not freezing. I feel that the score of freeze 10 is not satisfactory, what should I do? Do you reduce the number of layers to freeze?Is there any other way to fix it?

Apr 28 '25 09:04 Eliana-23

Hi @Eliana-23,

I understand freezing 10 layers significantly reduced your model's performance for detecting small defects. This makes sense because small defects often require fine-grained feature extraction that the backbone needs to be trained for.

Instead of freezing the entire backbone, try these alternatives:

Freeze fewer layers - try --freeze 0 1 2 to only freeze the earliest layers while allowing most of the network to adapt to your defects
Focus on more effective optimizations that won't hurt accuracy:
- Increase batch size to 32 (this is likely your biggest potential speed gain)
- Ensure --cache ram is working properly
- Set --workers 8 for faster data loading
Consider model pruning for better speed/accuracy tradeoff - there are pruned YOLOv5 models that maintain accuracy while increasing speed

Let me know if you try any of these approaches and how they work for your specific defect detection task.

Apr 28 '25 18:04 pderrenger

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

Oct 10 '25 00:10 github-actions[bot]