YOLO-World
YOLO-World copied to clipboard
Question about pretrained weights
Hi, YOLO-World Team! Big shoutout to the team for bringing such an excellent work!🚀 bringing open-vocabulary detector to real-time world! Thanks! 😄 I'm a core maintainer and ML engineer of Ultralytics YOLOv8, recently I'm trying to migrate YOLO-World weights into our YOLOv8 repo.
I've got really close to it. However today I found the weights from the Hugging Face/YOLO-World repo is kind of different from the ones in current github/YOLO-World repo.
From the mAP tables it seems like the ones lying in github page are better, but just to confirm, I'd like to ask which one is the major(better) weights and what's the difference between them. Thanks!
ok after checked out the config files, from the name of yolo_world_l_dual
and yolo_world_l_t2i
, I guess the former is using both Text-guided CSPLayers
and Image-Pooling
, and the latter only uses Text-guided CSPLayers
, so I guess the weights in github page are the better ones(just like the metric table shows higher results). :)
Another question, I noticed there's an empty placeholder " "
manually be added here: https://github.com/AILab-CVC/YOLO-World/blob/7d43247a0e67af0858f63fdc0ec0a4a9fe0a79b4/demo.py#L49 which I think it's probably a special trick to be the background
prompt for open-vocabulary detector?
Then I tested some images with and without this special empty placeholder " "
and resulted inconsistent behavior, for some prompts the predictions looks much better with the default empty placeholder(background), and some prompts results worse predictions with it.
-
Predict with
prompt=kid
:- with the empty placeholder
[" "]
: - without the empty placeholder
[" "]
: - the one without
[" "]
is with much lower confidence score.
- with the empty placeholder
-
Predict with
prompt=person
:- with the empty placeholder
[" "]
: - without the empty placeholder
[" "]
: - this time the one with
[" "]
is with much lower confidence score, which got contrary result compare toprompt=kid
.
- with the empty placeholder
Also after some more tests, it seems like for prompts(categories) that is contained in COCO dataset got much higher score without this empty placeholder [" "]
.
Other prompts i.e men
, women
, kid
, building
and so on got higher score with it.
Does anyone know what it behaviors in this way? Thanks!
EDIT: the way I test without [" "]
is manually commenting this part:
texts = [[t.strip()] for t in text.split(',')]# + [[' ']]
Hi @Laughing-q, thank you for your interest in YOLO-World. Exactly, we have two versions of YOLO-World. We provide the initial version (v1.0) of YOLO-World on GitHub (with T-CSPLayer and I-PoolingAttention) and a simple version (v2.0) of YOLO-World (with only T-CSPLayer) on HuggingFace. We plan to release the YOLO-World v2.0 with different model scales (S/M/L) and pre-trained weights later (not too long, maybe the end of February.)
@wondervictor Awesome! Thanks for the information!
This issue is most likely due to the use of category padding by " " in pre training. CoCo dataset does not require padding, so the score without " " is higher.
@Baboom-l yeah I was thinking about the same thing, but haven't really looked through the code of training process.
Hi @Laughing-q, and @Baboom-l, we have updated YOLO-World with better accuracy and efficiency. This GitHub version is consistent with the HuggingFace version. Additionally, we provide different scales of YOLO-World-v2
from s
to x
with pre-trained weights. You can have a try now!
@wondervictor Cool!