Query/Issue with Custom YOLOv5 Model and ONNX Export
Search before asking
- [X] I have searched the YOLOv5 issues and found no similar bug report.
YOLOv5 Component
Detection, Export
Bug
I am working with a custom-trained YOLOv5 model that was trained on a dataset with 4 classes. After exporting the model to ONNX format, I am facing discrepancies in the output tensor shape and class configurations, which are creating confusion and potential issues in downstream tasks. Below, I outline the details of my observations, potential root causes, and attempts to resolve the issue.
Environment
yolov5s.pt, ubuntu 22.04, in own system.
Minimal Reproducible Example
normal detection code from"https://github.com/arindal1/yolov5-onnx-object-recognition/blob/main/yolov5.py"
Additional
Observations:
Custom Model Details:
The .pt model was trained on a dataset with 4 classes (bird, drone, helicopter, jetplane).
When inspecting the .pt model, the number of classes is confirmed as 4 both in the names field and in the nc parameter from the data.yaml.
The .pt model performs as expected, detecting all 4 classes correctly during inference.
ONNX Export Details:
After exporting the model to ONNX, the output tensor shape is reported as [1, 8, 8400].
The 8 indicates the number of output channels in the detection head, which suggests it is configured for only 3 classes (5 + 3 = 8 instead of 5 + 4 = 9).
This is inconsistent with the .pt model, which was trained on 4 classes.
When checking the ONNX model metadata, the class names (bird, drone, helicopter, jetplane) are correctly stored, indicating 4 classes in the metadata.
Comparison with Default COCO Model:
For reference, the output tensor shape of a YOLOv5 model trained on the COCO dataset (80 classes) is [1, 25200, 85].
Here, 85 = 5 + 80 (5 for bounding box attributes + 80 for classes).
This format aligns with the expected configuration for YOLO models.
Key Issues:
Mismatch in Output Tensor Shape:
The ONNX modelās output tensor shape suggests it is configured for only 3 classes ([1, 8, 8400]), despite the .pt model being trained on 4 classes.
This raises concerns about whether the ONNX model will correctly detect all 4 classes.
Potential Causes of the Issue:
The detection head in the .pt model might have been misconfigured during training or export.
For 4 classes, the detection headās out_channels should be 5 + 4 = 9, but it appears to be set to 8.
The ONNX export process might not be correctly handling the modelās class configuration.
Implications for Object Detection:
If the ONNX model is truly configured for only 3 classes, it may fail to detect one of the classes or produce incorrect predictions.
Steps Taken to Debug:
Inspected Detection Head of .pt Model:
Verified the out_channels of the detection head (last layer).
The .pt modelās detection head is confirmed to have out_channels = 8, indicating a configuration for 3 classes.
This discrepancy persists despite the model being trained on 4 classes.
Verified ONNX Model Metadata:
Extracted metadata from the ONNX model, which correctly lists 4 class names (bird, drone, helicopter, jetplane).
Tried Re-exporting the Model:
Re-exported the .pt model to ONNX using the official YOLOv5 export script.
The issue with the output tensor shape ([1, 8, 8400]) remains.
Request for Assistance:
Clarification on Detection Head Configuration:
Could this issue arise from a misconfiguration of the detection head during training? If so, how can I fix it without retraining the model?
Is there a way to manually adjust the detection headās out_channels in the .pt model and re-export it to ONNX?
ONNX Export Process:
Are there known issues with the YOLOv5 ONNX export script that could cause this mismatch?
How can I ensure the ONNX modelās detection head is correctly configured for 4 classes?
General Guidance:
What steps can I take to verify that the ONNX model will correctly detect all 4 classes?
Are there tools or scripts you recommend for validating the ONNX modelās outputs?
Additional Context:
ultralytics - 2.4.1 PyTorch Version: 2.4.1
ONNX Runtime Version:1.16.3
Thank you for your assistance in resolving this issue!
Are you willing to submit a PR?
- [ ] Yes I'd like to help by submitting a PR!
š Hello @AbhirupSinha1811, thank you for your detailed report and for using YOLOv5 š! Your observations and debugging steps are very thorough, which is highly appreciated.
If this is indeed a š Bug Report, we kindly request a minimum reproducible example (MRE) to better assist in debugging this issue. An MRE would ideally contain simplified, complete code snippets and/or instructions to reproduce the ONNX export and the tensor shape discrepancy.
From the context provided, here are a few steps you can double-check:
- Detection Head Configuration: Ensure the YOLOv5 detection head reflects the correct
out_channelsvalue (which should match5 + number_of_classes) of the dataset both before and after training. - ONNX Metadata: Validate that the ONNX model metadata and the number of classes defined match the expected configurations.
- Re-export Process: Try re-exporting the model using the official export script with verbose logging enabled to identify any discrepancies during the export process.
Requirements
Ensure you are using Python>=3.8 with all dependencies installed correctly. Install requirements using:
pip install -r requirements.txt
Verified Environments
The ONNX export process is generally supported on environments such as notebooks, cloud platforms, or Docker. Make sure your training and export environments meet the dependencies, including PyTorch, CUDA, and ONNX runtime versions.
Additionally, it's worth confirming if the issue persists when running the export script on different setups or versions.
This is an automated response, but don't worry! An Ultralytics engineer will review your issue promptly to provide further assistance. In the meantime, feel free to share any additional findings or code snippets that could help us debug further š.
@AbhirupSinha1811 thank you for providing a detailed explanation of the issue. Based on your observations, it seems the problem stems from a misconfigured detection head in the .pt model. Here are some points to address your concerns:
-
Detection Head Configuration:
- The mismatch in
out_channels(8 instead of 9) indicates the model was trained with an incorrect detection head configuration for 4 classes. Unfortunately, this cannot be fixed without retraining the model, as the detection head's architecture is defined during training.
- The mismatch in
-
ONNX Export Process:
- The YOLOv5 export script correctly uses the configuration of the
.ptmodel for ONNX conversion. Since the.ptmodel itself is misconfigured, the ONNX model inherits the same issue. There are no known bugs in the export script that would alter the class configuration during conversion.
- The YOLOv5 export script correctly uses the configuration of the
-
Manual Adjustment (Without Retraining):
- While directly modifying the detection head's
out_channelsin the.ptmodel is theoretically possible, it is not recommended. Adjusting this manually would require significant changes to the model's architecture and weights, which is error-prone and may lead to unreliable results.
- While directly modifying the detection head's
-
Validation of ONNX Outputs:
- To verify the ONNX model's behavior, you can test it using the
detect.pyscript in ONNX mode:python detect.py --weights model.onnx --img-size 640 --dnn - If issues persist, visualizing the model using Netron can help confirm the final layer's configuration.
- To verify the ONNX model's behavior, you can test it using the
To resolve this issue definitively, it is recommended to retrain the model with the correct class configuration (4 classes). If you suspect a training script issue, ensure you are using the latest YOLOv5 version and verify the data.yaml and training parameters before starting.
Feel free to share further observations or questions. The YOLO community and Ultralytics team are here to help!
Hello, after check the detection head of the yolo .pt model what I'm get is given below:- "Dectection Head Output Channels": 68 "Number of Classes": 4 "class Name":[ 'bird', 'drone', 'helicopter', 'jetplane']
- Detection Head and Output Channels
Why does the detection head of my custom YOLOv5s model have 68 output channels when it was trained on 4 classes? Shouldnāt it be 27 (3 Ć (5 + 4) for 4 classes and 3 anchors)? Could this mismatch have happened during training? How can I check and fix it?
- Inference Behavior Then why the model detects all 4 classes correctly during inference with the .pt model??
How does it handle the extra channels? Is this behavior consistent across all formats like ONNX or TensorRT?
-
ONNX Export and Output Shape When exporting to ONNX, the output tensor shape is [1, 8, 8400] instead of [1, 27, grid_cells] for 4 classes. Why is this happening, and how can I fix it? Could the extra detection head channels (68) be causing this issue?
-
Debugging and Fixing How can I verify the number of classes (nc) and output channels used during training? Is there a way to fix the detection headās output channels post-training without retraining?
-
Recommendations Whatās the best way to ensure the detection head matches the number of classes during training and export? Are there tools or scripts to avoid issues like this during ONNX export?
Key Observations to Share Detection Head Output Channels: 68. Number of Classes (nc) in YAML: 4. Class Names: ['bird', 'drone', 'helicopter', 'jetplane']. ONNX Output Tensor Shape: [1, 8, 8400].
Behavior: Model detects all 4 classes correctly during inference with .pt but shows unexpected behavior during ONNX export.
@AbhirupSinha1811 thank you for the detailed observations. Here's a concise response addressing your queries:
-
Detection Head Output Channels (68 instead of 27):
The detection head'sout_channelsis determined by the architecture during training. A value of 68 suggests the model may have been configured with additional outputs, such as extra layers or custom modifications. To verify this, inspect the model's architecture and training script for any changes to the detection head. -
Correct Inference with .pt:
Despite the mismatch inout_channels, the.ptmodel likely filters outputs internally to match the 4 classes during inference. This behavior depends on how the post-processing step (e.g., NMS) is configured. It does not guarantee consistent behavior across formats like ONNX or TensorRT. -
ONNX Export Issue ([1, 8, 8400] output):
The ONNX export inherits the detection head configuration from the.ptmodel. The discrepancy in output shape likely results from the detection head misconfiguration during training. The[1, 8, 8400]output suggests the model is treating it as 3 classes (5 + 3 = 8). Fixing this requires retraining with the correct configuration. -
Debugging and Fixing:
- To verify the number of classes (
nc) used during training, check thedata.yamlfile and themodel.yamlor architecture definition. - Post-training fixes are not recommended as modifying detection head outputs requires retraining to ensure weight alignment.
- To verify the number of classes (
-
Recommendations:
- Ensure the
data.yamlandmodel.yamlfiles are correctly configured for the intended number of classes before training. - Use the latest YOLOv5 export script (
export.py) to minimize export-related issues. - Debug exported models using tools like Netron to visualize outputs and metadata.
- Ensure the
For further details on ONNX export, refer to the YOLOv5 Export Tutorial. Feel free to follow up with additional questions!
Hello, @pderrenger
I have reviewed the training script and data.yaml file thoroughly, and there have been no modifications. The script is standard and directly references data.yaml with nc=4 and class names: ["bird", "drone", "helicopter", "jetplane"]. No customizations or deviations have been made.
Training sample code:- #================================================================ import torch device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') print(device)
Training starts here
model = YOLO(data="data.yaml", epochs=100) # Initiates training with 100 epochs model = YOLO('runs/detect/train/weights/last.pt').to(device) # Loads the last checkpoint result = model.train(resume=True) # Resumes training #=======================================================================
YOLOv5 Version:- The YOLOv5 version used for training was downloaded from the official Ultralytics site and initiated on March 24, 2024.
Observed Issue:- Despite adhering to these configurations, the detection head outputs (no=68) mismatch the expected configuration for nc=4. This has resulted in discrepancies during ONNX export ([1, 8, 8400] output) and inference."
Training Script Behavior:
The training script appears standard and passes data="data.yaml" with nc=4. Is there any additional step required to ensure that the detection head is correctly initialized with the number of classes (4) during training?
When resuming training with resume=True, does the detection head automatically align with the nc value in data.yaml, or could it retain the configuration from the checkpoint (last.pt)?
Detection Head Configuration: What could cause the detection head to produce no=68 outputs when nc=4 is defined in data.yaml? Is this likely due to an issue during checkpoint initialization or training?
Does the model automatically reconfigure the detection head when nc changes, or does it require manual intervention (e.g., reinitializing layers)?
Data.yaml Verification:
The data.yaml file has nc=4 and lists four class. Are there any other factors (e.g., anchor settings or dataset labels) that could lead to a mismatch in detection head outputs?"
Does the order or format of the class names in data.yaml impact the detection head configuration during training?" Impact of Resume Training:
When resuming training with last.pt, could the detection head's configuration (e.g., no and anchors) differ from the new dataset's nc? If so, what steps are needed to realign the detection head?
Model Export and Compatibility: Could a mismatch between nc and no cause downstream issues, such as incorrect ONNX outputs or inference errors in
TensorRT If yes, how can these issues be resolved during export or training?
What is the best way to inspect the detection head during training or inference to verify its nc and no configuration? Are there specific checkpoints or logging steps recommended to avoid such mismatches?
Hello, @AbhirupSinha1811, and thank you for the detailed explanation and observations. Based on your description, here are some points to address your concerns:
-
Detection Head Mismatch (
no=68withnc=4):
The detection head'sno(number of outputs per anchor) is determined during training based on the formula:
no = (nc + 5) * number_of_anchors_per_layer. A value ofno=68implies some inconsistency in the configuration, possibly due to:- A mismatch in the
data.yamlfile or its interpretation during training. - A prior checkpoint (
last.pt) being loaded with a different architecture or parameters. YOLOv5 does not automatically reinitialize the detection head when resuming training (resume=True); it retains the configuration from the checkpoint.
- A mismatch in the
-
Resume Training Behavior:
Resuming training withresume=Truewill not realign the detection head to thedata.yamlfileāsncvalue if the checkpoint was trained with a different configuration. To avoid this, ensure that the initial checkpoint (last.pt) matches the current dataset'sncand other parameters. -
ONNX Export Mismatch:
The ONNX export inherits the trained modelās architecture. If the.ptcheckpoint has incorrectnovalues, the ONNX export ([1, 8, 8400]) will also reflect this. This can cause downstream issues with inference in TensorRT or other formats. -
Steps to Address the Issue:
- Verify Training Parameters: Ensure that the
nc=4indata.yamlaligns with the dataset and that no conflicting parameters are introduced. - Inspect the Checkpoint: Use the following code to inspect the detection headās configuration in the
.ptmodel before resuming training:model = torch.load('runs/detect/train/weights/last.pt') print(model['model'].names) # Class names print(model['model'].yaml) # Verify nc and other parameters - Reinitialize the Detection Head: If the
nomismatch persists, reinitialize the model with the correctncand retrain:model = YOLO(data='data.yaml', pretrained=False) # Initialize with correct nc model.train(epochs=100) - Inspect ONNX Outputs: Use Netron to visualize the exported ONNX model and confirm its architecture.
- Verify Training Parameters: Ensure that the
-
Key Consideration for Resume Training:
If the checkpoint (last.pt) was trained on a different dataset or configuration, it will retain the previousncand detection head configuration. Always verify that the checkpoint aligns with the current training setup before resuming.
Let us know if you need further clarification! For more export-related guidance, refer to the YOLOv5 Export Tutorial.
Hello , I've check the out custom .pt model into Netron and get this:-
Detection Head and Outputs
Why is the number of outputs (no) from the detection head 68, what are the factors are suppose to be responsible for this kind of value and if we do re-train what things keep in mind to before perform ?
The anchors tensor has a shape of float16[2,7497]. Is this correct for my custom-trained model, or does it indicate an issue?
How does the detection head configuration relate to the number of classes (nc=4)?
Anchors The anchors format is different from the standard YOLOv5 anchors (float16[3,3,2]). Could this cause issues during inference?
How can I confirm if the anchors used during training were correct for my dataset?
Training and Configuration Could the issues be caused by not explicitly using a model.yaml during training? Does YOLOv5 automatically adjust anchors for custom datasets, and how can I check this?
Hello, thank you for your observations. Here's a concise breakdown addressing your concerns:
-
The
no=68from the detection head suggests a mismatch between the expected number of outputs and your dataset configuration (nc=4). This could result from loading a checkpoint (last.pt) trained on a different setup without reinitializing the model. Retraining with a properly configuredmodel.yaml(matchingnc=4) is necessary to resolve this. -
The anchors tensor (
float16[2,7497]) is incorrect for YOLOv5, where anchors are typically shaped likefloat16[3,3,2]. This discrepancy could indicate issues in model initialization or training. Ensure that the correctmodel.yamlis used, and let YOLOv5 automatically calculate anchors during training. -
YOLOv5 adjusts anchors automatically for custom datasets unless explicitly overridden. To verify, inspect the
anchorsin yourmodel.yamlor training logs (AutoAnchorshould report if anchors are updated).
To avoid such issues, verify the model.yaml and data.yaml configurations before training and ensure logs report the expected setup. If needed, refer to the YOLOv5 architecture documentation for further details. Let us know if you need additional clarification!
Hello @pderrenger , could you please me out by explaining how could I find out or check the calculation takes place , due to which I'm getting the values 68 and in float16[2,7497] getting 7497 value , means any code way or through netron exact which layer before going to detection layer , where I can check what kind of values are passing to the output detection layer so we get this kind of wrong values..
Hello @AbhirupSinha1811,
To trace and understand the calculations resulting in no=68 and float16[2,7497], you can inspect the layers preceding the detection head using the following approaches:
-
Using Netron:
Open the.ptmodel in Netron and navigate to the layers directly before the detection head. Look for discrepancies in the output tensor shapes or parameters that might propagate incorrect values. -
Using Code:
Load the model and print the details of the layers before the detection head:import torch model = torch.load('runs/detect/train/weights/last.pt')['model'] for i, layer in enumerate(model.model[-1].m): # Iterate through detection layers print(f"Layer {i}: {layer}")You can also inspect the anchors and shapes:
print(f"Anchors: {model.yaml['anchors']}") print(f"Detection head outputs: {model.model[-1].no}")
This will help identify where the configuration might deviate from expectations. Let me know if you need further clarification!
Hello @pderrenger , go it , when I have to do retrain the model , could you please guide what are the parameters I've to keep in mind so that again this kind of values of The anchors tensor (float16[2,7497]) don't come again which is incorrect for YOLOv5.
Hello @AbhirupSinha1811, to avoid issues like incorrect anchor tensor shapes (float16[2,7497]) during retraining, ensure the following:
- Use the correct
data.yamlfile withncmatching the number of classes in your dataset. - Allow YOLOv5 to calculate anchors automatically (
--autoanchor=True), which adapts them to your custom dataset. - Verify the
model.yamlarchitecture matches your dataset needs, particularly the number of detection layers and anchors. - Avoid resuming training (
resume=True) if the initial checkpoint was trained on mismatched settings. Start fresh withpretrained=Falseor a compatible checkpoint.
For more on anchor generation, review the YOLOv5 Architecture Documentation. Let me know if further details are needed!
Hello,
I am currently working on retraining a YOLOv5 model using the last.pt checkpoint and I would like to continue training with additional epochs. I am considering using the --resume argument in the train.py script for this purpose.
Could you please confirm if using the --resume argument is the correct approach for continuing the training from the last.pt checkpoint with additional epochs
Hello, yes, using the --resume argument is the correct approach to continue training from the last.pt checkpoint. This will load the weights, optimizer state, and training parameters, seamlessly continuing from where the previous training left off. Ensure your last.pt checkpoint aligns with your current dataset and configuration. For more details, refer to the Ultralytics YOLOv5 Training Documentation.
Hello, In my case the training was not interrupted, just want to do retrain the model by using the last.pt with additional epochs , so in that case does the --resume will be required ?
Hello, the --resume argument is not required in this case. You can load the last.pt checkpoint and start a new training session with additional epochs by specifying the same data and epochs parameters. Using --resume is intended for continuing interrupted training runs. For your scenario, simply start training with the last.pt as your initial weights.
š Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.
For additional resources and information, please see the links below:
- Docs: https://docs.ultralytics.com
- HUB: https://hub.ultralytics.com
- Community: https://community.ultralytics.com
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!
Thank you for your contributions to YOLO š and Vision AI ā