GenPromp
GenPromp copied to clipboard
The parameter `test.combine_ratio` seem invalid when running inference
Hello,
The problem is that the parameter w or "{'test': {'combine_ratio': 0.6}}" in the readme.md doesn't seem to work when running inference.
I tried setting values 0, 1, 0.1, 0.9 and compared to the default result of w=0.6 (Also tested in 3 trained instances) ,this 11 result are the same, like :
Cls@1:0.938 Cls@5:0.988 Loc@1:0.481 Loc@5:0.506 Loc_gt:0.514
M-ins:0.000 Part:0.008 More:0.440 Right:0.481 Wrong:0.008 Cls:0.062
(I'm trying to replicate how fr or fd works alone.)
I confirm that I successfully adjusted the parameters ( source code location and output.INFO below ),
"INFO: Test Class [0-9]: [dataset: cub] [eval mode: top1] [cam thr: 0.23] [combine ratio: 0.9]".
So I don't understand why the result will be the same.(or may be )
After reading the paper, I understood w as:
fc = fd( 1-w ) + fr ( w ) w>>0,fc>>fd, heatmap'box will small and incomplete; w>>1,fc>>fr, heatmap'box will much bigger cause deteriorated by background noise;
I successfully trained this project on one 3090 24G close to the paper,
by reduced train.batch_size and set larger gradient_accumulation_steps.
(This parameter is set w=0.5 when train_unet.)
Your results at cub seem to be very low. The possible reason is that the model has overfitted in training stage 2. What's the performance of the model after training stage 1?
My statement is inappropriate. I trained 3 models and tried to modify w on 2 to test.
The data just shown above is a training effect after I changed the prompt template.
(on 3070 for a simple test of w)
And I reproduce your results using the default prompt template on 3090 as follows
Val Epoch: [0][2897/2897]
Cls@1:0.887 Cls@5:0.979 Loc@1:0.871 Loc@5:0.962 Loc_gt:0.982
M-ins:0.000 Part:0.014 More:0.001 Right:0.871 Wrong:0.000 Cls:0.113
- About "possible reason--overfitted in training stage 2".
Are you saying that in the paper, only using trained fr (set w=1) and SD's default unet for test on CUB ?
(or just set w=0.6, run test, Used to verify that it is worse than the default unet)
Like the article's 6.3.baseline Table2 of ImageNet-1K :
Embedding Top-1 Loc Top-5 Loc GT-kno. Loc
fd 61.2 69.0 70.4
right?
I'll do it right away and analyze/troubleshoot. (Both of the above)
When I do point2 , I know if it's overfitting.
- About the environment
I experimented with the local 3070 8G and the server 3090 24G, and the two experimental results shown so far are from the 3070 8G I'm still experimenting with parameters. At least under the following conditions, the 3090 is less effective than the 3070:
Rent Server 3090 24G {
Phase 1 train_token
train:
batch_size: 2 # origin 4
gradient_accumulation_steps: 2 # origin 1. The total amount is same
It becomes 90% 67C P2 273W / 350W | 19542MiB / 24576MiB | 100% Default
Phase 2 Read:
batch_size: 1 # origin4
gradient_accumulation_steps: 64 # origin 16. The total amount is same
It becomes 86% 64C P2 289W / 350W | 22702MiB / 24576MiB | 100% Default |
}
local 3070 8g {
data:
keep_class: [0, 9] # None, or say All200,
Phase 1 train_token
train:
batch_size: 1 # 4
gradient_accumulation_steps: 2 # origin 1. This is equivalent to half of the original simulated batch.
Phase 2 train_unet
train:
batch_size: 1 # 4
gradient_accumulation_steps: 4 # origin 16. The total amount is 16 times smaller. 4 for 8h, 3 for 4.5h.
}
# `total amount` I mean `original simulated batch=batch_size * gradient_accumulation_steps`.
# batch_size limited by video memory.
# gradient_accumulation_steps try to close the `total amount` and be able to run without `loss==Nan`.
Just said the 'worse' result using my prompt on 3090:
my prompt, All200 classes on 3090, test:
Val Epoch: [0][2897/2897]
Cls@1:0.887 Cls@5:0.979 Loc@1:0.308 Loc@5:0.334 Loc_gt:0.341
M-ins:0.000 Part:0.184 More:0.060 Right:0.308 Wrong:0.336 Cls:0.113
my prompt, [0,9] classes on 3070, test:
Val Epoch: [0][243/243]
Cls@1:0.938 Cls@5:0.988 Loc@1:0.481 Loc@5:0.506 Loc_gt:0.514
M-ins:0.000 Part:0.008 More:0.440 Right:0.481 Wrong:0.008 Cls:0.062
# The number of categories and training parameters are different. Both are retrained independently(stage 1&2),
# just for a simple contrast.
Thank you very much for the reply!
Reply 1. Based on the result, it seems that you have successfully reproduced the paper result, i.e., 98.2 gt-known loc on CUB, ovo.
Reply 2. In fact, you can also get good performance on a CUB using SD's default unet (98.0 gt-known loc), and a slight performance boost if you finetune the unet (98.3 gt-known loc). At CUB, we found that fd without training is good enough, probably because CUB does not have enough data to train a robust fr and performance on CUB is nearing saturation.
Reply 3. Based on our experiments, the results of the paper can be reproduced using 3090, and a similar GPU should also work.
As you said, I noticed that the accuracy really didn't change much before and after training the unet on the dataset CUB (every 40 steps), as shown in the figure.
I think it may be that CUB as a fine-grained dataset, on the one hand does not take advantage of the pre-training effect of CLIP very well, and on the other hand, the fr generated in this method does not work (I tried set w to 1 and 0, they are the same). (Maybe it's because fds are all birds, so CLIP can't distinguish.)
I tried to change an fd that might distinguish between larger datasets. In order to take advantage of GenPromp's innovative approach,
I tried to change a datasets which may cause a larger distinguish fd .
Normal categorical datasets, such as the small-scale caltech101 (which I think is more common for CLIP) and the animal dataset OxfordPets.
But I found that there was a problem that I couldn't solve, how the json was generated
GenPromp\ckpts\classification\cub_efficientnetb7.json
GenPromp\datasets\base.py use yaml.test.load_class_path :
self.pred_logits = None
if self.test_mode:
with open(self.load_class_path, 'r') as f:
name2result = json.load(f)
self.pred_logits = [torch.Tensor(name2result[name]['pred_scores']) for name in self.names]
self.image_paths = [os.path.join(image_dir, name + '.JPEG') for name in self.names]
self.num_images = len(self.labels)
The two mentions of benchmark in the paper seems to be this json file.
Json file contents like:
{"001.Black_footed_Albatross/Black_Footed_Albatross_0046_18.jpg": {"pred_label": 0, "pred_score": 0.7712172865867615, "pred_scores": [0.7712172865867615, 0.1352606564760208, 0.07805578410625458,
...
Hopefully you can tell me the source where the file is generated.
- We find that
fris effective on more difficult imagenet, while less effective on cub. - This json file is the classification result of a classification network (finetuned) in
mmpretrain. This project only contains the classification results of imagenet and cub. If you need to make wsol on other datasets, you can usegtkmode, which uses the gt class as prompt and requires no classification results.