yolov5 Pruning/Sparsity Tutorial

📚 This guide explains how to apply pruning to YOLOv5 🚀 models. UPDATED 25 September 2022.

Before You Start

Clone repo and install requirements.txt in a Python>=3.7.0 environment, including PyTorch>=1.7. Models and datasets download automatically from the latest YOLOv5 release.

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Test Normally

Before pruning we want to establish a baseline performance to compare to. This command tests YOLOv5x on COCO val2017 at image size 640 pixels. yolov5x.pt is the largest and most accurate model available. Other options are yolov5s.pt, yolov5m.pt and yolov5l.pt, or you own checkpoint from training a custom dataset ./weights/best.pt. For details on all available models please see our README table.

$ python val.py --weights yolov5x.pt --data coco.yaml --img 640 --half

Output:

val: data=/content/yolov5/data/coco.yaml, weights=['yolov5x.pt'], batch_size=32, imgsz=640, conf_thres=0.001, iou_thres=0.65, task=val, device=, workers=8, single_cls=False, augment=False, verbose=False, save_txt=False, save_hybrid=False, save_conf=False, save_json=True, project=runs/val, name=exp, exist_ok=False, half=True, dnn=False
YOLOv5 🚀 v6.0-224-g4c40933 torch 1.10.0+cu111 CUDA:0 (Tesla V100-SXM2-16GB, 16160MiB)

Fusing layers... 
Model Summary: 444 layers, 86705005 parameters, 0 gradients
val: Scanning '/content/datasets/coco/val2017.cache' images and labels... 4952 found, 48 missing, 0 empty, 0 corrupt: 100% 5000/5000 [00:00<?, ?it/s]
               Class     Images     Labels          P          R     [email protected] [email protected]:.95: 100% 157/157 [01:12<00:00,  2.16it/s]
                 all       5000      36335      0.732      0.628      0.683      0.496
Speed: 0.1ms pre-process, 5.2ms inference, 1.7ms NMS per image at shape (32, 3, 640, 640)  # <--- base speed

Evaluating pycocotools mAP... saving runs/val/exp2/yolov5x_predictions.json...
...
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.507  # <--- base mAP
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.689
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.552
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.345
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.559
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.652
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.381
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.630
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.682
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.526
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.731
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.829
Results saved to runs/val/exp

Test YOLOv5x on COCO (0.30 sparsity)

We repeat the above test with a pruned model by using the torch_utils.prune() command. We update val.py to prune YOLOv5x to 0.3 sparsity:

30% pruned output:

val: data=/content/yolov5/data/coco.yaml, weights=['yolov5x.pt'], batch_size=32, imgsz=640, conf_thres=0.001, iou_thres=0.65, task=val, device=, workers=8, single_cls=False, augment=False, verbose=False, save_txt=False, save_hybrid=False, save_conf=False, save_json=True, project=runs/val, name=exp, exist_ok=False, half=True, dnn=False
YOLOv5 🚀 v6.0-224-g4c40933 torch 1.10.0+cu111 CUDA:0 (Tesla V100-SXM2-16GB, 16160MiB)

Fusing layers... 
Model Summary: 444 layers, 86705005 parameters, 0 gradients
Pruning model...  0.3 global sparsity
val: Scanning '/content/datasets/coco/val2017.cache' images and labels... 4952 found, 48 missing, 0 empty, 0 corrupt: 100% 5000/5000 [00:00<?, ?it/s]
               Class     Images     Labels          P          R     [email protected] [email protected]:.95: 100% 157/157 [01:11<00:00,  2.19it/s]
                 all       5000      36335      0.724      0.614      0.671      0.478
Speed: 0.1ms pre-process, 5.2ms inference, 1.7ms NMS per image at shape (32, 3, 640, 640)  # <--- prune mAP

Evaluating pycocotools mAP... saving runs/val/exp3/yolov5x_predictions.json...
...
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.489  # <--- prune mAP
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.677
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.537
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.334
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.542
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.635
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.370
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.612
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.664
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.496
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.722
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.803
Results saved to runs/val/exp3

In the results we can observe that we have achieved a sparsity of 30% in our model after pruning, which means that 30% of the model's weight parameters in nn.Conv2d layers are equal to 0. Inference time is essentially unchanged, while the model's AP and AR scores a slightly reduced.

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

Jul 05 '20 20:07 glenn-jocher

@glenn-jocher why the speed doesn't change at all after prune? Is that only remove the weight of conv but not changed the structure actually? how to save the pruned model and it's architecture for retraining?

Jul 06 '20 06:07 lucasjinreal

Is there a guideline on how much we should prune by? What are the benefits to doing this?

Jul 06 '20 06:07 NanoCode012

@jinfagang yes, structure is not changed at all, and parameter count is the same, it's just that some of the weights are 0 instead of near zero as they were before.

I suppose this would allow for effective kmeans quantization to lower bits (for smaller filesizes), but I'm not sure about any possible speed improvement. I think as long as the parameter count remains the same, the speed will remain the same.

@NanoCode012 no guidelines really, its just an experiment to see how many of the weights you can remove and what effect that has on performance. Honestly I don't really see any great applications at the moment based on my results above, but it's there in case anyone would like to explore it further.

Jul 06 '20 19:07 glenn-jocher

@glenn-jocher Looka like prune has a remove method which can remove weights:

prune.remove(module, 'weight')

and all weights and params saved in module.state_dict which can be used for new pruned model.

Jul 07 '20 02:07 lucasjinreal

@jinfagang yes, this .remove() method is deleting the original weights as there is a pruned copy also in the model. So before applying remove the model/module will have 2X the normal parameters, after using it it is back to it's normal parameter count.

You have to consider the shapes of the operations in the forward pass. For a convolution from say shape(1,128,20,20) to shape(1,256,20,20) you must have a weight matrix of shape 128x256. It's not possible to remove elements from a normal matrix or tensor, as it will always need 128*256 weights inside it.

There are special cases of sparse matrices in some packages/languages, it may be possible pytorch is converting the original tensor to a sparse tensor with the same shape, though I'm not sure if this is the case. Even if it were, any exported models (i.e. onnx, coreml, tensorrt) using these sparse matrices would need special support for them, or they would be handled as normal matrices.

Jul 07 '20 02:07 glenn-jocher

The current pruning method incorporates the line of code you mention already as well: https://github.com/ultralytics/yolov5/blob/121d90b3f2ffe085176ca3a21bbbc87260667655/utils/torch_utils.py#L88-L97

Jul 07 '20 02:07 glenn-jocher

@glenn-jocher Nice. do u figure out how to obtain the pruned model architecture?

Jul 07 '20 02:07 lucasjinreal

@jinfagang well that's what I was saying, the architecture does not change. In my example above, the 128x256 convolution weights are still a 128x256 weights, it's just that some of their values that were previously near-zero have been set equal to zero during the pruning. The 128x256 matrix may or may not then be stored as a sparse matrix, which is a special type of matrix intended for use with data that contains mostly zeros, and saves memory (and maybe or maybe not also saves processing time).

TLDR the architecture is exactly the same when pruning, no layers are removed as far as I know, and the input and output shapes (and shapes of all intermediate layers) remain the same.

Jul 07 '20 03:07 glenn-jocher

@glenn-jocher so the simplified model can not get it's new channel num and shape automatically, is there anyway to make it happen?

Jul 07 '20 03:07 lucasjinreal

@glenn-jocher First feel your work! Let me ask you, which paper or project address is your pruning based on?

Jul 08 '20 04:07 Lornatang

@Lornatang I based this pruning implementation off of the original pytorch pruning tutorial at the link below, but the idea to apply pruning here originally came from @jinfagang. I don't actually have any experience pruning models. https://pytorch.org/tutorials/intermediate/pruning_tutorial.html

@jinfagang I modified detect.py to prune and save, and print updated model info:

    # Load model
    model = attempt_load(weights, map_location=device)  # load FP32 model
    torch_utils.model_info(model)
    torch.save({'model': model}, 'model_normal.pt')

    torch_utils.prune(model, 0.3)
    torch_utils.model_info(model)
    torch.save({'model': model}, 'model_pruned.pt')

Output:

Model Summary: 140 layers, 7.45958e+06 parameters, 7.45958e+06 gradients, 17.5 GFLOPS
Pruning model...  0.299 global sparsity
Model Summary: 140 layers, 7.45958e+06 parameters, 7.45958e+06 gradients, 17.5 GFLOPS

Model sizes are here (for both yolov5s in FP32): Screen Shot 2020-07-07 at 9 58 11 PM

Jul 08 '20 04:07 glenn-jocher

So maybe layer pruning or channel-level sparsity works better since it changes the architecture of the network? I have seen a project like this: https://github.com/tanluren/yolov3-channel-and-layer-pruning

Jul 12 '20 08:07 HenryWang628

@HenryWang628 I see, thanks for the link. The tensorboard histograms are very nice. So it seems a more useful method would be channel prune, mAP drop > finetune x epochs, recover some lost mAP.

This all raises the question though, if you are going to go through all of this effort on a large model like YOLOv5x, why not just train a smaller model like YOLOv5s? The training time will be much faster, and you don't need the extra pruning and finetuning steps.

Jul 12 '20 16:07 glenn-jocher

For anyone interested, there is a detailed discussion on this here https://github.com/pytorch/tutorials/issues/1054#issuecomment-657991827

The author there says this:

I'm not familiar with your architecture, so you'll have to decide which parameters it makes sense to pool together and compare via global magnitude-based pruning; but let's assume, just for the sake of this simple example, that you only want to consider the convolutional layers identified by the logic of my if-statement below [if those aren't the weights you care about, please feel free to modify that logic as you wish].

Now, those layers happen to come with two parameters: "weight" and "bias". Let's say you are interested in the weights [if you care about the biases too, feel free to add them in as well in the parameters_to_prune]. Alright, how do we tell global_unstructured to prune those weights in a global manner? We do so by constructing parameters_to_prune as requested by that function [again, see docs and tutorial linked above].
parameter_to_prune = [
    (v, "weight") 
    for k, v in dict(model.named_modules()).items()
    if ((len(list(v.children())) == 0) and (k.endswith('conv')))
]

# now you can use global_unstructured pruning
prune.global_unstructured(parameter_to_prune, pruning_method=prune.L1Unstructured, amount=0.3)
To check that that succeeded, you can now look at the global sparsity across those layers, which should be 30%, as well as the individual per-layer sparsity:
# global sparsity
nparams = 0
pruned = 0
for k, v in dict(model.named_modules()).items():
    if ((len(list(v.children())) == 0) and (k.endswith('conv'))):
        nparams += v.weight.nelement()
        pruned += torch.sum(v.weight == 0)
print('Global sparsity across the pruned layers: {:.2f}%'.format( 100. * pruned / float(nparams)))
# ^^ should be 30%

# local sparsity
for k, v in dict(model.named_modules()).items():
    if ((len(list(v.children())) == 0) and (k.endswith('conv'))):
        print(
            "Sparsity in {}: {:.2f}%".format(
                k,
                100. * float(torch.sum(v.weight == 0))
                / float(v.weight.nelement())
            )
        )
# ^^ will be different for each layer
Originally posted by @mickypaganini in https://github.com/pytorch/tutorials/issues/1054#issuecomment-657991827

Jul 14 '20 19:07 glenn-jocher

More info from https://github.com/pytorch/tutorials/pull/605#issuecomment-58599407

Hi @cranmer, Hopefully this tutorial will be included soon (cc: @soumith).

As is, this module is not intended (by itself) to help you with memory savings. All that pruning does is to replace some entries with zeroes. This itself doesn't buy you anything, unless you represent the sparse tensor in a smarter way (which this module itself doesn't handle for you). You can, however, rely on torch.sparse and other functionalities there to help you with that. To give you a concrete example:
import torch
import torch.nn.utils.prune as prune

t = torch.randn(100, 100)
torch.save(t, 'full.pth')

p = prune.L1Unstructured(amount=0.9)
pruned = p.prune(t)
torch.save(pruned, 'pruned.pth')

sparsified = pruned.to_sparse()
torch.save(sparsified, 'sparsified.pth')
When I ls, these are the sizes on disk:
21K sparsified.pth
40K pruned.pth
40K full.pth
By the way, before calling prune.remove, you can expect you memory footprint to be a lot higher than what you started out with, because for each pruned parameter you now have: the original parameter, the mask, and the pruned version of the tensor. Calling prune.remove brings you back to only having a single (now pruned) tensor per pruned parameter. Still, if you don't represent these pruned parameters smartly, the memory footprint at this point won't be any lower than what you started out with.

Originally posted by @mickypaganini in https://github.com/pytorch/tutorials/pull/605#issuecomment-585994076

Jul 14 '20 23:07 glenn-jocher

@glenn-jocher I think you can refer to https://github.com/vainf/torch-pruning, he has implemented this function in detail.

Jul 29 '20 09:07 Lornatang

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Sep 04 '20 00:09 github-actions[bot]

Hi, thank you everyone for the informative comments. Thanks Glen for this super-cool library. Not sure if there is a way to implement a line like - "sparsified = pruned.to_sparse()" (https://github.com/pytorch/tutorials/pull/605#issuecomment-58599407) for nn.conv2d?

I am trying to reduce the overall model weights. Eventually, I want to port this to a Jetson Nano. My understanding is that a smaller model yields --> faster speeds. Please correct me if my understanding is wrong. Thanks.

Sep 17 '20 03:09 shoebNTU

@shoebNTU any speed benefits would depend on the capability of your hardware and drivers to exploit sparse matrices, so there is no single answer to your question.

Sep 18 '20 00:09 glenn-jocher

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Nov 16 '20 00:11 github-actions[bot]

i tried to add torch_utils.prune(model, 0.3) to test.py and ran the command.

it gives me this error..

NameError: name 'torch_utils' is not defined

May 15 '21 11:05 joel5638

@joel5638 torch_utils.py is a file in the utils directory. You can import is by running the code below. I've updated the tutorial above also showing the import now.

from utils import torch_utils

May 16 '21 13:05 glenn-jocher

I just proposed this change which allows for structured (kernel) pruning thus changing the network's architecture. link

Oct 05 '21 13:10 Roulbac

@glenn-jocher why the speed doesn't change at all after prune? Is that only remove the weight of conv but not changed the structure actually? how to save the pruned model and it's architecture for retraining?

i also wang to know,why time does not change. about the pruning,can you explain it deeply?

Dec 15 '21 11:12 lzh1998-jansen

@lzh1998-lzh this particular pruning does not remove any layers, it only sets some values to zero.

Dec 15 '21 12:12 glenn-jocher

i do not konw

------------------ 原始邮件 ------------------ 发件人: "ultralytics/yolov5" @.>; 发送时间: 2021年12月31日(星期五) 下午2:34 @.>; @.@.>; 主题: Re: [ultralytics/yolov5] Pruning/Sparsity Tutorial (#304)

it can do faster or batter?

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.Message ID: @.***>

Dec 31 '21 06:12 lzh1998-jansen

Hi, I'm using Yolo v3 tiny to train a custom model. But I need to reduce inference time. Can I use the model pruning function for v3 model? Since it is in the v5 repo, I am not sure is it useful for my case.

Feb 26 '22 13:02 GulerEnes

@GulerEnes 👋 Hello! Thanks for asking about inference speed issues. Pruning is not recommended for speed improvements using this tutorial. YOLOv5 🚀 can be run on CPU (i.e. --device cpu, slow) or GPU if available (i.e. --device 0, faster). You can determine your inference device by viewing the YOLOv5 console output:

detect.py inference

python detect.py --weights yolov5s.pt --img 640 --conf 0.25 --source data/images/

YOLOv5 PyTorch Hub inference

import torch

# Model
model = torch.hub.load('ultralytics/yolov5', 'yolov5s')

# Images
dir = 'https://ultralytics.com/images/'
imgs = [dir + f for f in ('zidane.jpg', 'bus.jpg')]  # batch of images

# Inference
results = model(imgs)
results.print()  # or .show(), .save()
# Speed: 631.5ms pre-process, 19.2ms inference, 1.6ms NMS per image at shape (2, 3, 640, 640)

Increase Speeds

If you would like to increase your inference speed some options are:

Use batched inference with YOLOv5 PyTorch Hub
Reduce --img-size, i.e. 1280 -> 640 -> 320
Reduce model size, i.e. YOLOv5x -> YOLOv5l -> YOLOv5m -> YOLOv5s -> YOLOv5n
Use half precision FP16 inference with python detect.py --half and python val.py --half
Use a faster GPUs, i.e.: P100 -> V100 -> A100
Export to ONNX or OpenVINO for up to 3x CPU speedup (CPU Benchmarks)
Export to TensorRT for up to 5x GPU speedup
Use a free GPU backends with up to 16GB of CUDA memory:

Good luck 🍀 and let us know if you have any other questions!

Feb 26 '22 13:02 glenn-jocher

@jinfagang yes, structure is not changed at all, and parameter count is the same, it's just that some of the weights are 0 instead of near zero as they were before.

I suppose this would allow for effective kmeans quantization to lower bits (for smaller filesizes), but I'm not sure about any possible speed improvement. I think as long as the parameter count remains the same, the speed will remain the same.

@NanoCode012 no guidelines really, its just an experiment to see how many of the weights you can remove and what effect that has on performance. Honestly I don't really see any great applications at the moment based on my results above, but it's there in case anyone would like to explore it further.

There are real application values unless further steps can be done on top of pruning that also reduces the model weights and speeds up inference, which seems to be the job of sparsify() and the specific hardware design for acceleration. @shoebNTU @jinfagang @glenn-jocher I liked the comments that why not start training from a smaller model directly. In some cases, we might just want to simplify the steps that just download a slightly larger and better model, and use a simple method like prune() or sparsify(), or just an argument --prune 0.3, --sparsify 0.3 so that we can run the model on edge devices directly. @glenn-jocher do you see some values to add this kind of argument instead of modifying the val.py script ourselves? I understand that the architecture and weights won't change in the memory, which is unstructured pruning (Update Y2022M06D22Wed)

Jun 21 '22 20:06 bryanbocao

@bryanbo-cao yes good comments. There may be smarter ways to implement pruning in trained models as this tutorial is a bit out of date by now. If you have any better methods using prune() or sparsify() please let us know as we have not been focusing efforts on pruning/sparsity recently.

Jun 22 '22 09:06 glenn-jocher

yolov5 yolov5 copied to clipboard

Pruning/Sparsity Tutorial

Before You Start

Test Normally

Test YOLOv5x on COCO (0.30 sparsity)

Environments

Status

detect.py inference

YOLOv5 PyTorch Hub inference

Increase Speeds

yolov5
yolov5 copied to clipboard