GenPromp icon indicating copy to clipboard operation
GenPromp copied to clipboard

The parameter `test.combine_ratio` seem invalid when running inference

Open KevinLi-167 opened this issue 1 year ago • 5 comments

Hello,

The problem is that the parameter w or "{'test': {'combine_ratio': 0.6}}" in the readme.md doesn't seem to work when running inference. I tried setting values 0, 1, 0.1, 0.9 and compared to the default result of w=0.6 (Also tested in 3 trained instances) ,this 11 result are the same, like :

Cls@1:0.938     Cls@5:0.988     Loc@1:0.481     Loc@5:0.506     Loc_gt:0.514
M-ins:0.000     Part:0.008      More:0.440      Right:0.481     Wrong:0.008     Cls:0.062

(I'm trying to replicate how fr or fd works alone.)

I confirm that I successfully adjusted the parameters ( source code location and output.INFO below ), "INFO: Test Class [0-9]: [dataset: cub] [eval mode: top1] [cam thr: 0.23] [combine ratio: 0.9]". So I don't understand why the result will be the same.(or may be )

After reading the paper, I understood w as:

fc = fd( 1-w ) + fr ( w ) w>>0,fc>>fd, heatmap'box will small and incomplete; w>>1,fc>>fr, heatmap'box will much bigger cause deteriorated by background noise;

I successfully trained this project on one 3090 24G close to the paper, by reduced train.batch_size and set larger gradient_accumulation_steps. (This parameter is set w=0.5 when train_unet.)

KevinLi-167 avatar Apr 16 '24 17:04 KevinLi-167

Your results at cub seem to be very low. The possible reason is that the model has overfitted in training stage 2. What's the performance of the model after training stage 1?

callsys avatar Apr 16 '24 23:04 callsys

My statement is inappropriate. I trained 3 models and tried to modify w on 2 to test.

The data just shown above is a training effect after I changed the prompt template. (on 3070 for a simple test of w)

And I reproduce your results using the default prompt template on 3090 as follows


Val Epoch: [0][2897/2897]	
Cls@1:0.887	Cls@5:0.979	Loc@1:0.871	Loc@5:0.962	Loc_gt:0.982
M-ins:0.000	Part:0.014	More:0.001	Right:0.871	Wrong:0.000	Cls:0.113

  1. About "possible reason--overfitted in training stage 2".

Are you saying that in the paper, only using trained fr (set w=1) and SD's default unet for test on CUB ? (or just set w=0.6, run test, Used to verify that it is worse than the default unet) Like the article's 6.3.baseline Table2 of ImageNet-1K :

Embedding  Top-1 Loc     Top-5 Loc     GT-kno. Loc
      fd                61.2               69.0               70.4

right?

I'll do it right away and analyze/troubleshoot. (Both of the above) When I do point2 , I know if it's overfitting.

  1. About the environment

I experimented with the local 3070 8G and the server 3090 24G, and the two experimental results shown so far are from the 3070 8G I'm still experimenting with parameters. At least under the following conditions, the 3090 is less effective than the 3070:


Rent Server 3090 24G {
Phase 1 train_token 
train:
    batch_size: 2 # origin 4
    gradient_accumulation_steps: 2 # origin 1.  The total amount is same
	It becomes 90% 67C P2 273W / 350W |  19542MiB / 24576MiB |    100%      Default 

Phase 2 Read: 
    batch_size: 1 # origin4
    gradient_accumulation_steps: 64 # origin 16.  The total amount is same
	It becomes 86% 64C P2 289W / 350W |  22702MiB / 24576MiB |    100%      Default |

}

local 3070 8g {
    data:
        keep_class:  [0, 9] # None, or say All200,

Phase 1 train_token
    train:
        batch_size: 1 # 4
        gradient_accumulation_steps: 2 # origin 1.   This is equivalent to half of the original simulated batch.

Phase 2 train_unet
    train:
        batch_size: 1 # 4
        gradient_accumulation_steps: 4 # origin 16.   The total amount is 16 times smaller. 4 for 8h, 3 for 4.5h.
}

# `total amount`  I mean  `original simulated batch=batch_size * gradient_accumulation_steps`.
# batch_size limited by video memory. 
# gradient_accumulation_steps try to close the `total amount` and be able to run without `loss==Nan`.


Just said the 'worse' result using my prompt on 3090:

my prompt, All200 classes on 3090, test:

Val Epoch: [0][2897/2897]	
Cls@1:0.887	Cls@5:0.979	Loc@1:0.308	Loc@5:0.334	Loc_gt:0.341
M-ins:0.000	Part:0.184	More:0.060	Right:0.308	Wrong:0.336	Cls:0.113

my prompt, [0,9] classes on 3070, test:

Val Epoch: [0][243/243]	
Cls@1:0.938	Cls@5:0.988	Loc@1:0.481	Loc@5:0.506	Loc_gt:0.514
M-ins:0.000	Part:0.008	More:0.440	Right:0.481	Wrong:0.008	Cls:0.062

# The number of categories and training parameters are different. Both are retrained independently(stage 1&2),
# just for a simple contrast.

Thank you very much for the reply!

KevinLi-167 avatar Apr 17 '24 10:04 KevinLi-167

Reply 1. Based on the result, it seems that you have successfully reproduced the paper result, i.e., 98.2 gt-known loc on CUB, ovo. Reply 2. In fact, you can also get good performance on a CUB using SD's default unet (98.0 gt-known loc), and a slight performance boost if you finetune the unet (98.3 gt-known loc). At CUB, we found that fd without training is good enough, probably because CUB does not have enough data to train a robust fr and performance on CUB is nearing saturation. Reply 3. Based on our experiments, the results of the paper can be reproduced using 3090, and a similar GPU should also work.

callsys avatar Apr 17 '24 12:04 callsys

As you said, I noticed that the accuracy really didn't change much before and after training the unet on the dataset CUB (every 40 steps), as shown in the figure.

7-4-复现数据截图

I think it may be that CUB as a fine-grained dataset, on the one hand does not take advantage of the pre-training effect of CLIP very well, and on the other hand, the fr generated in this method does not work (I tried set w to 1 and 0, they are the same). (Maybe it's because fds are all birds, so CLIP can't distinguish.)

I tried to change an fd that might distinguish between larger datasets. In order to take advantage of GenPromp's innovative approach,

I tried to change a datasets which may cause a larger distinguish fd .

Normal categorical datasets, such as the small-scale caltech101 (which I think is more common for CLIP) and the animal dataset OxfordPets.

But I found that there was a problem that I couldn't solve, how the json was generated GenPromp\ckpts\classification\cub_efficientnetb7.json

GenPromp\datasets\base.py use yaml.test.load_class_path :

self.pred_logits = None
        if self.test_mode:
            with open(self.load_class_path, 'r') as f:
                name2result = json.load(f)
                self.pred_logits = [torch.Tensor(name2result[name]['pred_scores']) for name in self.names]
        self.image_paths = [os.path.join(image_dir, name + '.JPEG') for name in self.names]
        self.num_images = len(self.labels)

The two mentions of benchmark in the paper seems to be this json file.

Json file contents like:

{"001.Black_footed_Albatross/Black_Footed_Albatross_0046_18.jpg": {"pred_label": 0, "pred_score": 0.7712172865867615, "pred_scores": [0.7712172865867615, 0.1352606564760208, 0.07805578410625458, 
...

Hopefully you can tell me the source where the file is generated.

KevinLi-167 avatar Apr 18 '24 08:04 KevinLi-167

  1. We find that fr is effective on more difficult imagenet, while less effective on cub.
  2. This json file is the classification result of a classification network (finetuned) in mmpretrain. This project only contains the classification results of imagenet and cub. If you need to make wsol on other datasets, you can use gtk mode, which uses the gt class as prompt and requires no classification results.

callsys avatar Apr 18 '24 08:04 callsys