YOLO-World Zero-shot performance about YOLOWorldPromptDetector

I rush into the same question like before, #71 , #78 . I modify the config in configs/prompt_tuning_coco/, generate custom embedding file, to fine-tune my dataset which has 4 categories. When inference, I generate a new embedding file which has 7 categories(4 old classes seen in training and 3 new classes) and replace the old embedding file in the config. These 3 new classes CAN NOT be detected, even setting score threshold to 0.01 It seems like losing open-vocabulary/zero-shot ability.

Mar 19 '24 08:03 taofuyu

Hi @taofuyu, you need to freeze all parameters (backbone, head, and neck) except the embeddings. However, I need to double-check whether all layers are frozen.

Mar 19 '24 09:03 wondervictor

Ok, I will have a try and update the result.

Mar 19 '24 09:03 taofuyu

You can evaluate the 4-category detection and 3-category detection separately and then perform the joint evaluation.

Mar 19 '24 09:03 wondervictor

But, parameters of backbone, head, and neck are all frozen, and the only updated parameters 'embeddings' are not saved into hard disk (during inference, the pre-computed embedding file is still used), so it seems there is nothing changed in the model ?

Mar 19 '24 09:03 taofuyu

It seems to validate my idea. After running 10 epochs now, the model can only detect 'car', which appears in the pre-trained datasets, other new categories can not be detected (can be detected when not freeze the model)

Mar 19 '24 09:03 taofuyu

@taofuyu Do you know the difference between all_fine_tuning and prompt tuning? I'm not clear about the config file of all_fine_tuning

Mar 19 '24 09:03 Hudaodao99

@taofuyu Do you know the difference between all_fine_tuning and prompt tuning? I'm not clear about the config file of all_fine_tuning

you can compare these two files, by VSCode or something. The main difference is the value of freeze_all, True or False

Mar 19 '24 09:03 taofuyu

@Hudaodao99 It's my fault, I should start a branch to avoid misleading. The prompt tuning only optimizes the embeddings while the all finetuning optimizes all parameters without the need of a text encoder.

Mar 19 '24 10:03 wondervictor

But, parameters of backbone, head, and neck are all frozen, and the only updated parameters 'embeddings' are not saved into hard disk (during inference, the pre-computed embedding file is still used), so it seems there is nothing changed in the model ?

@wondervictor

Mar 19 '24 12:03 taofuyu

@Hudaodao99 It's my fault, I should start a branch to avoid misleading. The prompt tuning only optimizes the embeddings while the all finetuning optimizes all parameters without the need of a text encoder.

Thanks for your answer!

Mar 20 '24 01:03 Hudaodao99

But, parameters of backbone, head, and neck are all frozen, and the only updated parameters 'embeddings' are not saved into hard disk (during inference, the pre-computed embedding file is still used), so it seems there is nothing changed in the model ?

@taofuyu I'll check it.

Mar 20 '24 06:03 wondervictor

@taofuyu I met the same problem. But in prompt tuning on my custom dataset(10 class), I find if I write the number of prompt text less than 10, it has error. like this: (I just write 2 prompt text, which all are not in my dataset, but class is more than 2)

class= [1 2 4 4 3 6] confidence= [0.97107273 0.90503085 0.8864812 0.86314565 0.32898653 0.20567985] Traceback (most recent call last): File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 198, in inference_detector(runner, File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 108, in inference_detector labels = [ File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 109, in f"{texts[class_id][0]} {confidence:0.2f}" for class_id, confidence in IndexError: list index out of range

Have you met the same question?

Mar 20 '24 07:03 Hudaodao99

@taofuyu I met the same problem. But in prompt tuning on my custom dataset(10 class), I find if I write the number of prompt text less than 10, it has error. like this: (I just write 2 prompt text, which all are not in my dataset, but class is more than 2)

class= [1 2 4 4 3 6] confidence= [0.97107273 0.90503085 0.8864812 0.86314565 0.32898653 0.20567985] Traceback (most recent call last): File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 198, in inference_detector(runner, File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 108, in inference_detector labels = [ File "/data/yolo_world_finetune/YOLO-World-0319v2/image_demo.py", line 109, in f"{texts[class_id][0]} {confidence:0.2f}" for class_id, confidence in IndexError: list index out of range

Have you met the same question?

detections的结果还是config里embeddings\num_classes的设置的样子，而texts是你命令行里直接输入的，数量不一样就导致维度不匹配了。正确的做法应该是你测试的时候，需要哪几个类别，就生成哪几类的新的embeddings 并修改对应的num_classes，并与命令行的texts保持一致。

Mar 20 '24 08:03 taofuyu

detections的结果还是config里embeddings\num_classes的设置的样子，而texts是你命令行里直接输入的，数量不一样就导致维度不匹配了。正确的做法应该是你测试的时候，需要哪几个类别，就生成哪几类的新的embeddings 并修改对应的num_classes，并与命令行的texts保持一致。

Thanks!

Mar 21 '24 01:03 Hudaodao99

I attempt to find a way out this issue thus going to learn more about OVD algorithms. In MM-grounding-DINO, it mentions that close-set fine-tuing will lose OVD generality. Maybe this is the reason why my model can not detect these 3 new classes. I'm not sure. You can take this as a reference. @wondervictor

Mar 21 '24 06:03 taofuyu

Furthermore, it mentions that, mix COCO data with some of the pre-trained data will improve performance on the COCO dataset as much as possible without compromising generalization. My experiments demonstrate it is right. I mix flicker30k/QGA with my custom data to train YOLOWorldDetector, the model can detect my categories and remain OVD-ability. But, if so, it means YOLOWorldPromptDetecor, can only be fine-tuned as a close-set detector, cause grounding data can not be used during training YOLOWorldPromptDetecor.

Mar 21 '24 07:03 taofuyu

We did not expect it, the original intention of prompt tuning is to retain the zero-shot capability and generalization and to achieve stronger performance on custom datasets.

Mar 21 '24 07:03 wondervictor

Hi @taofuyu, it seems that the configs in configs/prompt_tuning_coco wrongly use base_lr=2e-3. It's a mistake I've made. For fine-tuning all modules, the base_lr should be set to 2e-4. As for training prompts only, I'm going to check again.

Mar 21 '24 08:03 wondervictor

Hi @taofuyu, it seems that the configs in configs/prompt_tuning_coco wrongly use base_lr=2e-3. It's a mistake I've made. For fine-tuning all modules, the base_lr should be set to 2e-4. As for training prompts only, I'm going to check again.

thanks, I already changed the lr to 2e-4 during my fine-tuning.

Mar 21 '24 08:03 taofuyu

@Hudaodao99 It's my fault, I should start a branch to avoid misleading. The prompt tuning only optimizes the embeddings while the all finetuning optimizes all parameters without the need of a text encoder.

@wondervictor Hi! I'm not quite sure what the difference is between the purpose of all-tuning and prompt-tuning? Can all-tuning achieve open-vocabulary detection and custom detection together, like prompt-tuning? Also, through the prompt-tuning, can we generate and export our own custom npy file?

Mar 22 '24 10:03 Hudaodao99

@taofuyu 你好，请问你把学习率调整为2e-4后微调效果如何呢？能否解决微调后失去开集检测能力的问题呢？

Mar 26 '24 06:03 mio410

我也有同样的问题，我在本地微调自己的数据集之后，自己的数据集20个类，每个类有不同的text prompt，我想在微调自己数据集之后，保留原始预训练权重的clip的zeroshot能力。但是似乎结果不是这样的。比如常用的peroson、people、human都可以检测，但是自己的数据集中，不同文本就检测不了。

Mar 27 '24 00:03 xiyangyang99

@mio410 No @xiyangyang99 same question @wondervictor Hello, any updates on this question ?

Apr 03 '24 03:04 taofuyu

Hi @taofuyu, @xiyangyang99, @Hudaodao99, and @mio410, sorry for the delay. I'll check it and provide solutions asap. Please stay tuned and please let me know if you have any updates.

Apr 03 '24 08:04 wondervictor

Can separate inference solve the problem. It comes to me that some interference between each prompts may cause the problem.@taofuyu

Apr 08 '24 07:04 Yindong-Zhang

Can separate inference solve the problem. It comes to me that some interference between each prompts may cause the problem.@taofuyu

sorry, could you please explain this in detail ?

Apr 08 '24 08:04 taofuyu

One text prompt may interfere the inference process of the other, you can refer to the text-guided CSPlayer in the paper. I would also like use the prompt tuning technic, hope to solve this issue. like mentioned in : https://github.com/AILab-CVC/YOLO-World/issues/154#issuecomment-2006452067 if separate inference and evaluation is correct, it may overpass the problem.

Apr 08 '24 13:04 Yindong-Zhang

@taofuyu any update?in case you don't notice the answer above.

Apr 15 '24 13:04 Yindong-Zhang

@Yindong-Zhang, ongoing

Apr 15 '24 14:04 wondervictor

I think, just tuning custom data with GoldG is fine. Model can detect custom categories and retain OVD ability at the same time.

Apr 16 '24 01:04 taofuyu

YOLO-World YOLO-World copied to clipboard

Zero-shot performance about YOLOWorldPromptDetector

YOLO-World
YOLO-World copied to clipboard