YOLO-World Question about pretrained weights

Hi, YOLO-World Team! Big shoutout to the team for bringing such an excellent work!🚀 bringing open-vocabulary detector to real-time world! Thanks! 😄 I'm a core maintainer and ML engineer of Ultralytics YOLOv8, recently I'm trying to migrate YOLO-World weights into our YOLOv8 repo.

I've got really close to it. However today I found the weights from the Hugging Face/YOLO-World repo is kind of different from the ones in current github/YOLO-World repo. pic-240207-1122-35 pic-240207-1122-30 From the mAP tables it seems like the ones lying in github page are better, but just to confirm, I'd like to ask which one is the major(better) weights and what's the difference between them. Thanks!

Feb 07 '24 06:02 Laughing-q

ok after checked out the config files, from the name of yolo_world_l_dual and yolo_world_l_t2i, I guess the former is using both Text-guided CSPLayers and Image-Pooling, and the latter only uses Text-guided CSPLayers, so I guess the weights in github page are the better ones(just like the metric table shows higher results). :)

Feb 07 '24 07:02 Laughing-q

Another question, I noticed there's an empty placeholder " " manually be added here: https://github.com/AILab-CVC/YOLO-World/blob/7d43247a0e67af0858f63fdc0ec0a4a9fe0a79b4/demo.py#L49 which I think it's probably a special trick to be the background prompt for open-vocabulary detector? Then I tested some images with and without this special empty placeholder " " and resulted inconsistent behavior, for some prompts the predictions looks much better with the default empty placeholder(background), and some prompts results worse predictions with it.

Predict with prompt=kid:
- with the empty placeholder [" "]:
- without the empty placeholder [" "]:
- the one without [" "] is with much lower confidence score.
Predict with prompt=person:
- with the empty placeholder [" "]:
- without the empty placeholder [" "]:
- this time the one with [" "] is with much lower confidence score, which got contrary result compare to prompt=kid.

Also after some more tests, it seems like for prompts(categories) that is contained in COCO dataset got much higher score without this empty placeholder [" "]. Other prompts i.e men, women, kid， building and so on got higher score with it.

Does anyone know what it behaviors in this way? Thanks!

EDIT: the way I test without [" "] is manually commenting this part:

texts = [[t.strip()] for t in text.split(',')]# + [[' ']]

Feb 07 '24 09:02 Laughing-q

Hi @Laughing-q, thank you for your interest in YOLO-World. Exactly, we have two versions of YOLO-World. We provide the initial version (v1.0) of YOLO-World on GitHub (with T-CSPLayer and I-PoolingAttention) and a simple version (v2.0) of YOLO-World (with only T-CSPLayer) on HuggingFace. We plan to release the YOLO-World v2.0 with different model scales (S/M/L) and pre-trained weights later (not too long, maybe the end of February.)

Feb 07 '24 11:02 wondervictor

@wondervictor Awesome! Thanks for the information!

Feb 07 '24 11:02 Laughing-q

This issue is most likely due to the use of category padding by " " in pre training. CoCo dataset does not require padding, so the score without " " is higher.

Feb 07 '24 12:02 Baboom-l

@Baboom-l yeah I was thinking about the same thing, but haven't really looked through the code of training process.

Feb 07 '24 13:02 Laughing-q

Hi @Laughing-q, and @Baboom-l, we have updated YOLO-World with better accuracy and efficiency. This GitHub version is consistent with the HuggingFace version. Additionally, we provide different scales of YOLO-World-v2 from s to x with pre-trained weights. You can have a try now!

Mar 01 '24 12:03 wondervictor

@wondervictor Cool!

Mar 01 '24 12:03 Laughing-q

YOLO-World YOLO-World copied to clipboard

Question about pretrained weights

YOLO-World
YOLO-World copied to clipboard