sparseml Performance Degradation in YOLOv8s Model Exported to ONNX via SparseML's Exporter

Describe the bug

When exporting the YOLOv8s (pruned50-quant, model.pt from sparsezoo) model via the ONNX exporter (sparseml.ultralytics.export_onnx), its performance noticeably decreases compared to the ONNX model available in SparseZoo

Expected behavior

Perfomance of the two ONNX files should be the same, as it is the same model.

Environment Include all relevant environment information:

OS: Ubuntu 22.04
Python version: 3.9.19
SparseML version or commit hash: sparseml==1.7.0
ML framework version(s): torch==2.1.2
Other Python package versions: deepsparse==1.7.1, sparsezoo==1.7.0, ultralytics==8.0.124
Other relevant environment information: CPU: i9-12900KS

To Reproduce Exact steps to reproduce the behavior:

Download model.onnx for yolov8s-pruned50-quant from SparseZoo (https://sparsezoo.neuralmagic.com/models/yolov8-s-coco-pruned50_quantized). Benchmark it using deepsparse.benchmark:

> deepsparse.benchmark yolov8s-coco-pruned50_quantized.onnx
2024-05-10 13:56:31 deepsparse.benchmark.helpers INFO     Thread pinning to cores enabled
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.7.1 COMMUNITY | (3904e8ec) (release) (optimized) (system=avx2_vnni, binary=avx2)
2024-05-10 13:56:31 deepsparse.benchmark.benchmark_model INFO     deepsparse.engine.Engine:
	onnx_file_path: yolov8s-coco-pruned50_quantized.onnx
	batch_size: 1
	num_cores: 8
	num_streams: 1
	scheduler: Scheduler.default
	fraction_of_supported_ops: 1.0
	cpu_avx_type: avx2
	cpu_vnni: True
2024-05-10 13:56:31 deepsparse.utils.onnx INFO     Generating input 'images', type = uint8, shape = [1, 3, 640, 640]
2024-05-10 13:56:31 deepsparse.benchmark.benchmark_model INFO     Starting 'singlestream' performance measurements for 10 seconds
Original Model Path: yolov8s-coco-pruned50_quantized.onnx
Batch Size: 1
Scenario: sync
Throughput (items/sec): 87.1154
Latency Mean (ms/batch): 11.4735
Latency Median (ms/batch): 11.4148
Latency Std (ms/batch): 0.2300
Iterations: 872

Notice fraction_of_supported_ops: 1.0 and Throughput (items/sec): 87.1154.

Now download model.pt from the same page and export it to ONNX using the provided tool:

> sparseml.ultralytics.export_onnx --model yolov8s-coco-pruned50_quantized.pt

                   from  n    params  module                                       arguments                     
  0                  -1  1       928  ultralytics.nn.modules.conv.Conv             [3, 32, 3, 2]                 
  1                  -1  1     18560  ultralytics.nn.modules.conv.Conv             [32, 64, 3, 2]                
  2                  -1  1     29056  ultralytics.nn.modules.block.C2f             [64, 64, 1, True]             
  3                  -1  1     73984  ultralytics.nn.modules.conv.Conv             [64, 128, 3, 2]               
  4                  -1  2    197632  ultralytics.nn.modules.block.C2f             [128, 128, 2, True]           
  5                  -1  1    295424  ultralytics.nn.modules.conv.Conv             [128, 256, 3, 2]              
  6                  -1  2    788480  ultralytics.nn.modules.block.C2f             [256, 256, 2, True]           
  7                  -1  1   1180672  ultralytics.nn.modules.conv.Conv             [256, 512, 3, 2]              
  8                  -1  1   1838080  ultralytics.nn.modules.block.C2f             [512, 512, 1, True]           
  9                  -1  1    656896  ultralytics.nn.modules.block.SPPF            [512, 512, 5]                 
 10                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 11             [-1, 6]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 12                  -1  1    591360  ultralytics.nn.modules.block.C2f             [768, 256, 1]                 
 13                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 14             [-1, 4]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 15                  -1  1    148224  ultralytics.nn.modules.block.C2f             [384, 128, 1]                 
 16                  -1  1    147712  ultralytics.nn.modules.conv.Conv             [128, 128, 3, 2]              
 17            [-1, 12]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 18                  -1  1    493056  ultralytics.nn.modules.block.C2f             [384, 256, 1]                 
 19                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]              
 20             [-1, 9]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 21                  -1  1   1969152  ultralytics.nn.modules.block.C2f             [768, 512, 1]                 
 22        [15, 18, 21]  1   2147008  ultralytics.nn.modules.head.Detect           [80, [128, 256, 512]]         
Model summary: 225 layers, 11166560 parameters, 11166544 gradients

Applying structure from sparseml checkpoint at epoch -1
2024-05-10 13:58:11 sparseml.pytorch.utils.logger INFO     Logging all SparseML modifier-level logs to sparse_logs/10-05-2024_13.58.11.log
Loaded previous weights from checkpoint
Source: 'sparseml' detected; Exporting model from SparseML checkpoint...
/home/user/anaconda3/envs/sparse_issue_env/lib/python3.9/site-packages/torch/onnx/utils.py:823: UserWarning: It is recommended that constant folding be turned off ('do_constant_folding=False') when exporting the model in training-amenable mode, i.e. with 'training=TrainingMode.TRAIN' or 'training=TrainingMode.PRESERVE' (when model is in training mode). Otherwise, some learnable model parameters may not translate correctly in the exported ONNX model because constant folding mutates model parameters. Please consider turning off constant folding or setting the training=TrainingMode.EVAL.
  warnings.warn(
/home/user/anaconda3/envs/sparse_issue_env/lib/python3.9/site-packages/ultralytics/nn/modules/head.py:50: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  elif self.dynamic or self.shape != shape:
2024-05-10 13:58:15 sparseml.exporters.transforms.onnx_transform INFO     [ConstantsToInitializers] Transformed 92 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [FoldIdentityInitializers] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [InitializersToUint8] Transformed 54 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [FlattenQParams] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [FoldConvDivBn] Transformed 57 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [DeleteRepeatedQdq] Transformed 2 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [QuantizeQATEmbedding] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [PropagateEmbeddingQuantization] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [PropagateDequantThroughSplit] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [MatMulAddToMatMulIntegerAddCastMul] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [MatMulToMatMulIntegerCastMul] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [FoldReLUQuants] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [ConvToConvIntegerAddCastMul] Transformed 55 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [GemmToQLinearMatMul] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [GemmToMatMulIntegerAddCastMul] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [QuantizeResiduals] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO     [RemoveDuplicateQConvWeights] Transformed 0 matches
2024-05-10 13:58:17 sparseml.exporters.transforms.onnx_transform INFO     [RemoveDuplicateQuantizeOps] Transformed 0 matches
2024-05-10 13:58:17 sparseml.pytorch.sparsification.quantization.quantize_qat_export INFO     Model initial QuantizeLinear node(s) deleted and inputs set to uint8
2024-05-10 13:58:17 sparseml.pytorch.utils.exporter INFO     Created deployment folder at /home/user/Desktop/projects/sparse/issue/exported/deployment
2024-05-10 13:58:17 sparseml.pytorch.utils.exporter INFO     Saved model.onnx in the deployment folder at /home/user/Desktop/projects/sparse/issue/exported/deployment/model.onnx
2024-05-10 13:58:17 sparseml.pytorch.utils.exporter INFO     Created config.json file at /home/user/Desktop/projects/sparse/issue/exported/deployment
Recipe checkpoint detected, saving the recipe to the deployment directory /home/user/Desktop/projects/sparse/issue/exported/deployment

Conversion is successful. Now benchmark exported onnx model:

> deepsparse.benchmark exported/model.onnx 
2024-05-10 13:59:27 deepsparse.benchmark.helpers INFO     Thread pinning to cores enabled
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.7.1 COMMUNITY | (3904e8ec) (release) (optimized) (system=avx2_vnni, binary=avx2)
2024-05-10 13:59:27 deepsparse.benchmark.benchmark_model INFO     deepsparse.engine.Engine:
	onnx_file_path: exported/model.onnx
	batch_size: 1
	num_cores: 8
	num_streams: 1
	scheduler: Scheduler.default
	fraction_of_supported_ops: 0.0
	cpu_avx_type: avx2
	cpu_vnni: True
2024-05-10 13:59:27 deepsparse.utils.onnx INFO     Generating input 'images', type = uint8, shape = [1, 3, 640, 640]
2024-05-10 13:59:27 deepsparse.benchmark.benchmark_model INFO     Starting 'singlestream' performance measurements for 10 seconds
Original Model Path: exported/model.onnx
Batch Size: 1
Scenario: sync
Throughput (items/sec): 20.2886
Latency Mean (ms/batch): 49.2855
Latency Median (ms/batch): 49.0293
Latency Std (ms/batch): 2.1290
Iterations: 203

Notice fraction_of_supported_ops: 0.0 and Throughput (items/sec): 20.2886.

Throughput decreased from ~88 down to ~20 for the same model.

May 10 '24 10:05 rsazizov

Model exported: https://drive.google.com/file/d/1ZDlRd6c1X05lrnxRThUo8FxuapS5Kgm7/view?usp=sharing

You can see that this style of Conv is not being folded to a ConvInteger correctly - @bfineran Screenshot 2024-05-10 at 9 52 55 AM

May 10 '24 13:05 mgoin

@mgoin we'll need to take a look at the recipe and its application - conv integer requires two quantized inputs (weight and act) to the Conv, here we see a quantize (weight) input and the output being quantized (although this may be the input quantization to another layer)

May 15 '24 20:05 bfineran

@bfineran Thank you for great work :)

Wanted to let you know that I exactly am having the same performance degradation as @rsazizov on yolov8n from Throughput (items/sec): 110.0278 (on sparsezoo-yolov8n) to Throughput (items/sec): 15.5770 after converting the sparsezoo-yolov8n .pt model using sparseml onnx exporter. Is there any known bug or update on the issue?

Jun 10 '24 10:06 imAhmadAsghar

Hi @imAhmadAsghar we're aware of the issue and are looking into it internally - it doesn't seem to be a version compatibility issue, but you could potentially try rolling back your sparseml/pytorch versions. The issue seems to be that the model exports differently now at the beginning (a simple split node is not a few slices).

Jun 14 '24 19:06 bfineran

@bfineran Thank you for your response.

I actually did not get the last part of your response which is "The issue seems to be that the model exports differently now at the beginning (a simple split node is not a few slices)." Can you please explain what do you mean by that in detail, if possible? I am not a performance/optimization engineer and I just want to use sparseml/deepsparse to speed up the inference on CPU. However, the whole library is inconvenient and super foggy.

I have tested the following:

I exported the base yolov8 model (without any recipes) via onnx sparseml.
I exported the pruned yolov8 model (trained with the provided pruning recipe on sparsezoo) via onnx sparseml.
I exported the pruned and quantized yolov8 model (trained with the provided recipe on sparsezoo) via onnx sparseml.

And here are the results: Performance test between pruned and default model: As you can see in the above plot that the prunning does nothing.

Performance test between pruned vs pruned and quantized model: I just don't get this plot. Nothing makes sense at all. The quantization does not work and it is getting super slow by a high margin.

Right now, I am super confused and it does not make any sense to use your library at all. I think I am lacking a lot of information regarding the whole process. Can you please provide me with the proper reference where to start because the one that is provided on the homepage is not leading me anywhere as you can see from the results.

I would really love to get it run and achieve the results you promised.

Jun 18 '24 10:06 imAhmadAsghar

@imAhmadAsghar Hi, could you find a fix to this? What is going wrong with the exports?

Jun 28 '24 19:06 yoloyash

@yoloyash Hi, no I could not unfortunately.

Jul 02 '24 09:07 imAhmadAsghar

@rsazizov @imAhmadAsghar Hi I exactly had the same problem too, I was training yolov8-n-coco-pruned49_quantized from official sparseml zoo, and export with sparseml.ultralytics.export_onnx, then when benchmarking, it shows fraction_of_supported_ops: 0.0, and onnx graph is not same to the downloaded official onnx yolov8-n-coco-pruned49_quantized

when Analyze it, has error of no weights

@bfineran can you help with it? I guess it's either the recipe or the export that caused this problem

Jul 19 '24 02:07 mydhui

@mydhui you could try exporting a non quantized FP32 model to see if the problematic slice node is still there around this conv. Additionally, you could skip this conv during quantization to export a runnable model

Jul 24 '24 20:07 bfineran

Hi has anyone managed to find versions of libraries where quantization does not break the model trained on custom dataset?

Aug 26 '24 11:08 KozlovKY

sparseml sparseml copied to clipboard

Performance Degradation in YOLOv8s Model Exported to ONNX via SparseML's Exporter

sparseml
sparseml copied to clipboard