sparseml
sparseml copied to clipboard
Performance Degradation in YOLOv8s Model Exported to ONNX via SparseML's Exporter
Describe the bug
When exporting the YOLOv8s (pruned50-quant, model.pt from sparsezoo) model via the ONNX exporter (sparseml.ultralytics.export_onnx), its performance noticeably decreases compared to the ONNX model available in SparseZoo
Expected behavior
Perfomance of the two ONNX files should be the same, as it is the same model.
Environment Include all relevant environment information:
- OS: Ubuntu 22.04
- Python version: 3.9.19
- SparseML version or commit hash: sparseml==1.7.0
- ML framework version(s): torch==2.1.2
- Other Python package versions: deepsparse==1.7.1, sparsezoo==1.7.0, ultralytics==8.0.124
- Other relevant environment information: CPU: i9-12900KS
To Reproduce Exact steps to reproduce the behavior:
Download model.onnx for yolov8s-pruned50-quant from SparseZoo (https://sparsezoo.neuralmagic.com/models/yolov8-s-coco-pruned50_quantized). Benchmark it using deepsparse.benchmark:
> deepsparse.benchmark yolov8s-coco-pruned50_quantized.onnx
2024-05-10 13:56:31 deepsparse.benchmark.helpers INFO Thread pinning to cores enabled
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.7.1 COMMUNITY | (3904e8ec) (release) (optimized) (system=avx2_vnni, binary=avx2)
2024-05-10 13:56:31 deepsparse.benchmark.benchmark_model INFO deepsparse.engine.Engine:
onnx_file_path: yolov8s-coco-pruned50_quantized.onnx
batch_size: 1
num_cores: 8
num_streams: 1
scheduler: Scheduler.default
fraction_of_supported_ops: 1.0
cpu_avx_type: avx2
cpu_vnni: True
2024-05-10 13:56:31 deepsparse.utils.onnx INFO Generating input 'images', type = uint8, shape = [1, 3, 640, 640]
2024-05-10 13:56:31 deepsparse.benchmark.benchmark_model INFO Starting 'singlestream' performance measurements for 10 seconds
Original Model Path: yolov8s-coco-pruned50_quantized.onnx
Batch Size: 1
Scenario: sync
Throughput (items/sec): 87.1154
Latency Mean (ms/batch): 11.4735
Latency Median (ms/batch): 11.4148
Latency Std (ms/batch): 0.2300
Iterations: 872
Notice fraction_of_supported_ops: 1.0 and Throughput (items/sec): 87.1154.
Now download model.pt from the same page and export it to ONNX using the provided tool:
> sparseml.ultralytics.export_onnx --model yolov8s-coco-pruned50_quantized.pt
from n params module arguments
0 -1 1 928 ultralytics.nn.modules.conv.Conv [3, 32, 3, 2]
1 -1 1 18560 ultralytics.nn.modules.conv.Conv [32, 64, 3, 2]
2 -1 1 29056 ultralytics.nn.modules.block.C2f [64, 64, 1, True]
3 -1 1 73984 ultralytics.nn.modules.conv.Conv [64, 128, 3, 2]
4 -1 2 197632 ultralytics.nn.modules.block.C2f [128, 128, 2, True]
5 -1 1 295424 ultralytics.nn.modules.conv.Conv [128, 256, 3, 2]
6 -1 2 788480 ultralytics.nn.modules.block.C2f [256, 256, 2, True]
7 -1 1 1180672 ultralytics.nn.modules.conv.Conv [256, 512, 3, 2]
8 -1 1 1838080 ultralytics.nn.modules.block.C2f [512, 512, 1, True]
9 -1 1 656896 ultralytics.nn.modules.block.SPPF [512, 512, 5]
10 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
11 [-1, 6] 1 0 ultralytics.nn.modules.conv.Concat [1]
12 -1 1 591360 ultralytics.nn.modules.block.C2f [768, 256, 1]
13 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
14 [-1, 4] 1 0 ultralytics.nn.modules.conv.Concat [1]
15 -1 1 148224 ultralytics.nn.modules.block.C2f [384, 128, 1]
16 -1 1 147712 ultralytics.nn.modules.conv.Conv [128, 128, 3, 2]
17 [-1, 12] 1 0 ultralytics.nn.modules.conv.Concat [1]
18 -1 1 493056 ultralytics.nn.modules.block.C2f [384, 256, 1]
19 -1 1 590336 ultralytics.nn.modules.conv.Conv [256, 256, 3, 2]
20 [-1, 9] 1 0 ultralytics.nn.modules.conv.Concat [1]
21 -1 1 1969152 ultralytics.nn.modules.block.C2f [768, 512, 1]
22 [15, 18, 21] 1 2147008 ultralytics.nn.modules.head.Detect [80, [128, 256, 512]]
Model summary: 225 layers, 11166560 parameters, 11166544 gradients
Applying structure from sparseml checkpoint at epoch -1
2024-05-10 13:58:11 sparseml.pytorch.utils.logger INFO Logging all SparseML modifier-level logs to sparse_logs/10-05-2024_13.58.11.log
Loaded previous weights from checkpoint
Source: 'sparseml' detected; Exporting model from SparseML checkpoint...
/home/user/anaconda3/envs/sparse_issue_env/lib/python3.9/site-packages/torch/onnx/utils.py:823: UserWarning: It is recommended that constant folding be turned off ('do_constant_folding=False') when exporting the model in training-amenable mode, i.e. with 'training=TrainingMode.TRAIN' or 'training=TrainingMode.PRESERVE' (when model is in training mode). Otherwise, some learnable model parameters may not translate correctly in the exported ONNX model because constant folding mutates model parameters. Please consider turning off constant folding or setting the training=TrainingMode.EVAL.
warnings.warn(
/home/user/anaconda3/envs/sparse_issue_env/lib/python3.9/site-packages/ultralytics/nn/modules/head.py:50: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
elif self.dynamic or self.shape != shape:
2024-05-10 13:58:15 sparseml.exporters.transforms.onnx_transform INFO [ConstantsToInitializers] Transformed 92 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO [FoldIdentityInitializers] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO [InitializersToUint8] Transformed 54 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO [FlattenQParams] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO [FoldConvDivBn] Transformed 57 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO [DeleteRepeatedQdq] Transformed 2 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO [QuantizeQATEmbedding] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO [PropagateEmbeddingQuantization] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO [PropagateDequantThroughSplit] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO [MatMulAddToMatMulIntegerAddCastMul] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO [MatMulToMatMulIntegerCastMul] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO [FoldReLUQuants] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO [ConvToConvIntegerAddCastMul] Transformed 55 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO [GemmToQLinearMatMul] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO [GemmToMatMulIntegerAddCastMul] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO [QuantizeResiduals] Transformed 0 matches
2024-05-10 13:58:16 sparseml.exporters.transforms.onnx_transform INFO [RemoveDuplicateQConvWeights] Transformed 0 matches
2024-05-10 13:58:17 sparseml.exporters.transforms.onnx_transform INFO [RemoveDuplicateQuantizeOps] Transformed 0 matches
2024-05-10 13:58:17 sparseml.pytorch.sparsification.quantization.quantize_qat_export INFO Model initial QuantizeLinear node(s) deleted and inputs set to uint8
2024-05-10 13:58:17 sparseml.pytorch.utils.exporter INFO Created deployment folder at /home/user/Desktop/projects/sparse/issue/exported/deployment
2024-05-10 13:58:17 sparseml.pytorch.utils.exporter INFO Saved model.onnx in the deployment folder at /home/user/Desktop/projects/sparse/issue/exported/deployment/model.onnx
2024-05-10 13:58:17 sparseml.pytorch.utils.exporter INFO Created config.json file at /home/user/Desktop/projects/sparse/issue/exported/deployment
Recipe checkpoint detected, saving the recipe to the deployment directory /home/user/Desktop/projects/sparse/issue/exported/deployment
Conversion is successful. Now benchmark exported onnx model:
> deepsparse.benchmark exported/model.onnx
2024-05-10 13:59:27 deepsparse.benchmark.helpers INFO Thread pinning to cores enabled
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.7.1 COMMUNITY | (3904e8ec) (release) (optimized) (system=avx2_vnni, binary=avx2)
2024-05-10 13:59:27 deepsparse.benchmark.benchmark_model INFO deepsparse.engine.Engine:
onnx_file_path: exported/model.onnx
batch_size: 1
num_cores: 8
num_streams: 1
scheduler: Scheduler.default
fraction_of_supported_ops: 0.0
cpu_avx_type: avx2
cpu_vnni: True
2024-05-10 13:59:27 deepsparse.utils.onnx INFO Generating input 'images', type = uint8, shape = [1, 3, 640, 640]
2024-05-10 13:59:27 deepsparse.benchmark.benchmark_model INFO Starting 'singlestream' performance measurements for 10 seconds
Original Model Path: exported/model.onnx
Batch Size: 1
Scenario: sync
Throughput (items/sec): 20.2886
Latency Mean (ms/batch): 49.2855
Latency Median (ms/batch): 49.0293
Latency Std (ms/batch): 2.1290
Iterations: 203
Notice fraction_of_supported_ops: 0.0 and Throughput (items/sec): 20.2886.
Throughput decreased from ~88 down to ~20 for the same model.
Model exported: https://drive.google.com/file/d/1ZDlRd6c1X05lrnxRThUo8FxuapS5Kgm7/view?usp=sharing
You can see that this style of Conv is not being folded to a ConvInteger correctly - @bfineran
@mgoin we'll need to take a look at the recipe and its application - conv integer requires two quantized inputs (weight and act) to the Conv, here we see a quantize (weight) input and the output being quantized (although this may be the input quantization to another layer)
@bfineran Thank you for great work :)
Wanted to let you know that I exactly am having the same performance degradation as @rsazizov on yolov8n from Throughput (items/sec): 110.0278 (on sparsezoo-yolov8n) to Throughput (items/sec): 15.5770 after converting the sparsezoo-yolov8n .pt model using sparseml onnx exporter. Is there any known bug or update on the issue?
Hi @imAhmadAsghar we're aware of the issue and are looking into it internally - it doesn't seem to be a version compatibility issue, but you could potentially try rolling back your sparseml/pytorch versions. The issue seems to be that the model exports differently now at the beginning (a simple split node is not a few slices).
@bfineran Thank you for your response.
I actually did not get the last part of your response which is "The issue seems to be that the model exports differently now at the beginning (a simple split node is not a few slices)." Can you please explain what do you mean by that in detail, if possible? I am not a performance/optimization engineer and I just want to use sparseml/deepsparse to speed up the inference on CPU. However, the whole library is inconvenient and super foggy.
I have tested the following:
- I exported the base yolov8 model (without any recipes) via onnx sparseml.
- I exported the pruned yolov8 model (trained with the provided pruning recipe on sparsezoo) via onnx sparseml.
- I exported the pruned and quantized yolov8 model (trained with the provided recipe on sparsezoo) via onnx sparseml.
And here are the results:
Performance test between pruned and default model:
As you can see in the above plot that the prunning does nothing.
Performance test between pruned vs pruned and quantized model:
I just don't get this plot. Nothing makes sense at all. The quantization does not work and it is getting super slow by a high margin.
Right now, I am super confused and it does not make any sense to use your library at all. I think I am lacking a lot of information regarding the whole process. Can you please provide me with the proper reference where to start because the one that is provided on the homepage is not leading me anywhere as you can see from the results.
I would really love to get it run and achieve the results you promised.
@imAhmadAsghar Hi, could you find a fix to this? What is going wrong with the exports?
@yoloyash Hi, no I could not unfortunately.
@rsazizov @imAhmadAsghar Hi I exactly had the same problem too, I was training yolov8-n-coco-pruned49_quantized from official sparseml zoo, and export with sparseml.ultralytics.export_onnx, then when benchmarking, it shows fraction_of_supported_ops: 0.0, and onnx graph is not same to the downloaded official onnx yolov8-n-coco-pruned49_quantized
when Analyze it, has error of no weights
@bfineran can you help with it? I guess it's either the recipe or the export that caused this problem
@mydhui you could try exporting a non quantized FP32 model to see if the problematic slice node is still there around this conv. Additionally, you could skip this conv during quantization to export a runnable model
Hi has anyone managed to find versions of libraries where quantization does not break the model trained on custom dataset?