ultralytics `ValueError` when training `yoloe-11n-seg` from scratch

Search before asking

[x] I have searched the Ultralytics YOLO issues and found no similar bug report.

Ultralytics YOLO Component

Train

Bug

I'm trying to train the yoloe-11n-seg model from scratch. Following the instructions in the YOLOE page, I downloaded all datasets and created the script below, with only minor changes to account for limited GPU resources (see the MRP below). However, when I run the script I run into the following error:

Traceback (most recent call last):
  File "train_yoloe.py", line 22, in <module>
    model.train(
  File "/home/user/airchair/lib/python3.8/site-packages/ultralytics/engine/model.py", line 791, in train
    self.trainer.train()
  File "/home/user/airchair/lib/python3.8/site-packages/ultralytics/engine/trainer.py", line 211, in train
    self._do_train(world_size)
  File "/home/user/airchair/lib/python3.8/site-packages/ultralytics/engine/trainer.py", line 385, in _do_train
    loss, self.loss_items = self.model(batch)
  File "/home/user/airchair/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/airchair/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/airchair/lib/python3.8/site-packages/ultralytics/nn/tasks.py", line 119, in forward
    return self.loss(x, *args, **kwargs)
  File "/home/user/airchair/lib/python3.8/site-packages/ultralytics/nn/tasks.py", line 1043, in loss
    return self.criterion(preds, batch)
  File "/home/user/airchair/lib/python3.8/site-packages/ultralytics/utils/loss.py", line 332, in __call__
    masks = F.interpolate(masks[None], (mask_h, mask_w), mode="nearest")[0]
  File "/home/user/airchair/lib/python3.8/site-packages/torch/nn/functional.py", line 3983, in interpolate
    raise ValueError(
ValueError: Input and output must have the same number of spatial dimensions, but got input with spatial dimensions of [] and output size of (160, 160). Please provide input tensor in (N, C, d1, d2, ...,dK) format and output size in (o1, o2, ...,oK) format.

Any guesses on what's going wrong?

Environment

Ultralytics 8.3.107 🚀 Python-3.8.10 torch-2.4.1+cu121 CUDA:0 (NVIDIA GeForce RTX 3090, 24135MiB)
Setup complete ✅ (20 CPUs, 94.0 GB RAM, 1645.0/1832.7 GB disk)

OS                  Linux-6.8.0-52-generic-x86_64-with-glibc2.29
Environment         Linux
Python              3.8.10
Install             pip
Path                /home/user/airchair/lib/python3.8/site-packages/ultralytics
RAM                 93.98 GB
Disk                1645.0/1832.7 GB
CPU                 Intel Core(TM) i9-10900X 3.70GHz
CPU count           20
GPU                 NVIDIA GeForce RTX 3090, 24135MiB
GPU count           2
CUDA                12.1

numpy               ✅ 1.24.4<=2.1.1,>=1.23.0
matplotlib          ✅ 3.7.5>=3.3.0
opencv-python       ✅ 4.11.0.86>=4.6.0
pillow              ✅ 8.4.0>=7.1.2
pyyaml              ✅ 5.3.1>=5.3.1
requests            ✅ 2.32.3>=2.23.0
scipy               ✅ 1.10.1>=1.4.1
torch               ✅ 2.4.1>=1.8.0
torch               ✅ 2.4.1!=2.4.0,>=1.8.0; sys_platform == "win32"
torchvision         ✅ 0.19.1>=0.9.0
tqdm                ✅ 4.67.1>=4.64.0
psutil              ✅ 5.9.8
py-cpuinfo          ✅ 9.0.0
pandas              ✅ 2.0.3>=1.1.4
seaborn             ✅ 0.13.2>=0.11.0
ultralytics-thop    ✅ 2.0.14>=2.0.0

Minimal Reproducible Example

from ultralytics import YOLOE
from ultralytics.models.yolo.yoloe import YOLOESegTrainerFromScratch

data = dict(
    train=dict(
        yolo_data=["Objects365.yaml"],
        grounding_data=[
            dict(
                img_path="../datasets/flickr/full_images/",
                json_file="../datasets/flickr/annotations/final_flickr_separateGT_train_segm.json",
            ),
            dict(
                img_path="../datasets/mixed_grounding/gqa/images",
                json_file="../datasets/mixed_grounding/annotations/final_mixed_train_no_coco_segm.json",
            ),
        ],
    ),
    val=dict(yolo_data=["lvis.yaml"]),
)

model = YOLOE("yoloe-11n-seg.yaml")
model.train(
    data=data,
    batch=4, # Reduce batch size to account for smaller GPU memory
    epochs=960, # Increase epochs to compensate
    close_mosaic=2,
    optimizer="AdamW",
    lr0=2e-3,
    warmup_bias_lr=0.0,
    weight_decay=0.025,
    momentum=0.9,
    workers=4,
    trainer=YOLOESegTrainerFromScratch,
    device="0" # Use a single GPU instead of 8
)

Additional

No response

Are you willing to submit a PR?

[ ] Yes I'd like to help by submitting a PR!

May 09 '25 15:05 xperroni

👋 Hello @xperroni, thank you for your detailed bug report and for providing a thorough minimal reproducible example (MRE) 🚀! This is an automated response to help get you started—an Ultralytics engineer will also review and assist you soon.

For new users, we recommend checking the Docs where you'll find helpful Python and CLI usage examples. Many common questions are answered there, which might help you troubleshoot while you wait.

If you have further details or updates to your MRE, please share them to help our team reproduce and resolve the issue efficiently 🛠️.

Join the Ultralytics community for discussion and support:

Real-time chat: Discord 🎧
In-depth forums: Discourse
Community threads: Subreddit

Upgrade

Please ensure you are using the latest ultralytics package and all required dependencies in a Python>=3.8 environment with PyTorch>=1.8. This helps verify the issue isn't resolved in a newer release:

pip install -U ultralytics

See all requirements in pyproject.toml.

Environments

YOLO models can be run in any of these up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python, and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all Ultralytics CI tests are passing. CI tests verify correct operation of all YOLO Modes and Tasks on macOS, Windows, and Ubuntu every 24 hours and on every commit.

Thanks again for your report! An Ultralytics engineer will follow up soon.

May 09 '25 15:05 UltralyticsAssistant

Can you try latest version?

May 09 '25 16:05 Y-T-G

Upgraded to version 8.3.130, as shown in the environment report below. Still getting same error.

Ultralytics 8.3.130 🚀 Python-3.8.10 torch-2.4.1+cu121 CUDA:0 (NVIDIA GeForce RTX 3090, 24135MiB)
Setup complete ✅ (20 CPUs, 94.0 GB RAM, 1645.0/1832.7 GB disk)

OS                  Linux-6.8.0-52-generic-x86_64-with-glibc2.29
Environment         Linux
Python              3.8.10
Install             pip
Path                /home/user/airchair/lib/python3.8/site-packages/ultralytics
RAM                 93.98 GB
Disk                1645.0/1832.7 GB
CPU                 Intel Core(TM) i9-10900X 3.70GHz
CPU count           20
GPU                 NVIDIA GeForce RTX 3090, 24135MiB
GPU count           2
CUDA                12.1

numpy               ✅ 1.24.4>=1.23.0
matplotlib          ✅ 3.7.5>=3.3.0
opencv-python       ✅ 4.11.0.86>=4.6.0
pillow              ✅ 8.4.0>=7.1.2
pyyaml              ✅ 5.3.1>=5.3.1
requests            ✅ 2.32.3>=2.23.0
scipy               ✅ 1.10.1>=1.4.1
torch               ✅ 2.4.1>=1.8.0
torch               ✅ 2.4.1!=2.4.0,>=1.8.0; sys_platform == "win32"
torchvision         ✅ 0.19.1>=0.9.0
tqdm                ✅ 4.67.1>=4.64.0
psutil              ✅ 5.9.8
py-cpuinfo          ✅ 9.0.0
pandas              ✅ 2.0.3>=1.1.4
seaborn             ✅ 0.13.2>=0.11.0
ultralytics-thop    ✅ 2.0.14>=2.0.0

May 09 '25 16:05 xperroni

@Laughing-q Can you check?

May 09 '25 16:05 Y-T-G

I did notice however that I get different errors depending on the size of the batch argument. For example, if I put batch back to 128 this is what I get:

Traceback (most recent call last):
  File "train_yoloe.py", line 22, in <module>
    model.train(
  File "/home/user/airchair/lib/python3.8/site-packages/ultralytics/engine/model.py", line 793, in train
    self.trainer.train()
  File "/home/user/airchair/lib/python3.8/site-packages/ultralytics/engine/trainer.py", line 212, in train
    self._do_train(world_size)
  File "/home/user/airchair/lib/python3.8/site-packages/ultralytics/engine/trainer.py", line 385, in _do_train
    batch = self.preprocess_batch(batch)
  File "/home/user/airchair/lib/python3.8/site-packages/ultralytics/models/yolo/yoloe/train.py", line 200, in preprocess_batch
    txt_feats = torch.stack([self.text_embeddings[text] for text in texts]).to(self.device)
  File "/home/user/airchair/lib/python3.8/site-packages/ultralytics/models/yolo/yoloe/train.py", line 200, in <listcomp>
    txt_feats = torch.stack([self.text_embeddings[text] for text in texts]).to(self.device)
KeyError: 'what'

May 09 '25 16:05 xperroni

Thanks for providing the detailed error information and testing with different batch sizes. This appears to be an issue with the segmentation mask processing during training.

The first error indicates a problem with mask dimensions during interpolation - the input mask has no spatial dimensions but the function is trying to resize it to 160x160. This typically happens when there's a mismatch between the expected segmentation mask format and what's being provided in the dataset.

The second error with larger batch sizes (KeyError: 'what') suggests an issue with text embeddings - the model is trying to load an embedding for the word "what" which isn't in the embedding dictionary.

This could be related to dataset preparation or compatibility issues with the current implementation. Since you're training from scratch, I'd suggest:

Check that your datasets (Objects365, Flickr, and mixed_grounding) are properly formatted for segmentation tasks
Try with a pre-built model first (using yoloe-11n-seg.pt instead of yoloe-11n-seg.yaml)
Consider reducing the complexity by starting with just one dataset to isolate the issue

These issues often relate to dataset preprocessing specifics that are expected by the YOLOE implementation. I'll ask our development team to look more closely at the YOLOE segmentation training code path.

May 09 '25 22:05 glenn-jocher

@xperroni Hey the error seems to me that you were train a segmentation model with detection labels. Did you use the generated mask annotations for Object365 dataset? https://docs.ultralytics.com/models/yoloe/#train-official-models and use this script from official team to convert the segment json labels to yolo format.

Alternatively I wonder probably you could directly use the objects365_train_segm.json as grounding_data by doing:

data = dict(
    train=dict(
        grounding_data=[
            dict(
                img_path="../datasets/flickr/full_images/",
                json_file="../datasets/flickr/annotations/final_flickr_separateGT_train_segm.json",
            ),
            dict(
                img_path="../datasets/mixed_grounding/gqa/images",
                json_file="../datasets/mixed_grounding/annotations/final_mixed_train_no_coco_segm.json",
            ),
            dict(
                img_path="../datasets/object365/images",
                json_file="../datasets/object365/annotations/objects365_train_segm.json",
            ),
        ],
    ),
    val=dict(yolo_data=["lvis.yaml"]),
)

I haven't tried Object365 this way as I don't have the dataset locally for now. Please feel free to try it and let me know if there's any issues.

May 10 '25 01:05 Laughing-q

Alternatively I wonder probably you could directly use the objects365_train_segm.json as grounding_data

Yes, that did the trick. I had previously run the training script without manually downloading Objects365 first, so it was auto-downloaded, but obviously with the annotations for detection instead of segmentation.

I suppose a documentation update is in order...?

May 13 '25 15:05 xperroni

I'm glad to hear the solution worked, @xperroni! The issue was indeed related to annotation formats - the auto-downloaded Objects365 dataset uses detection annotations, while YOLOE segmentation models require proper segmentation mask annotations.

You make a valid point about documentation. This distinction between detection and segmentation annotations should be clearer in our docs, especially since the auto-downloaded version doesn't contain the required segmentation masks. I'll create an internal note to improve this documentation, making it more explicit that segmentation training requires different annotation formats than detection.

Thanks for reporting this issue and verifying the solution. This feedback helps us improve both the code and documentation for all users.

May 14 '25 05:05 glenn-jocher

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

Jun 17 '25 00:06 github-actions[bot]