Hi, I'm running into a weight structure mismatch when I run the test.py
if args.EVALUATION.ckpt_used is not None:
filepath = os.path.join(root_model, f'{args.EVALUATION.ckpt_used}.pth')
assert os.path.isfile(filepath), filepath
print("=> loading model weight '{}'".format(filepath),flush=True)
checkpoint = torch.load(filepath)
model.load_state_dict(checkpoint['state_dict'])
print("=> loaded model weight '{}'".format(filepath),flush=True)
https://github.com/Seunggu0305/VLCounter/blob/2dc15ddd218744c2c3c63b667fa0bc4a24ce8c3c/tools/models/VLCounter.py#L36
Can you comment out above line and try running the test again?
Sorry, I set the flag variable to False like you said, but it still doesn't work, hope to get your help
You should leave the flag variable to True and comment out the mentioned line. Try to replace VLCounter.py L36~L39 as below.
# if flag:
self.gn = nn.GroupNorm(8, out_channels)
self.gelu = nn.GELU()
self.up = nn.UpsamplingBilinear2d(scale_factor=2)
Hello Dear Author, I noticed that the weight for contrast loss is set to 1e-6, which means that contrast learning doesn't seem to play a major role. May I ask why you set the weights so small?
Thank you very much for your previous patience and I look forward to your response.
You should leave the flag variable to True and comment out the mentioned line. Try to replace VLCounter.py L36~L39 as below.
# if flag: self.gn = nn.GroupNorm(8, out_channels) self.gelu = nn.GELU() self.up = nn.UpsamplingBilinear2d(scale_factor=2)
Was the problem solved by the method mentioned above?
The reason for setting the lambda value small is simply to adjust the scale of the value. Since the value of L2 loss is very small, the lambda value should be small and can be considered to play a significant role.
Hello author, first of all, thank you for your patient reply, my previous questions have been answered, I would like to express my sincere thanks to you!
I still have questions for you. In the process of reproducing your code, I found that the pre-training weights loaded by the visual and text encoders in the VLCounter.py file are both ViT-B-16.pt, may I ask what is the reason for that? And after checking huggingface website, I found that the clip weights are in the form of "pytorch_model.bin", may I ask how did you get the "ViT-B-16.pt"? Looking forward to your reply!
You can download *.pt weight files of CLIP from original repo.
- https://github.com/openai/CLIP/blob/a1d071733d7111c9c014f024669f959182114e33/clip/clip.py#L30-L40
You should leave the flag variable to True and comment out the mentioned line. Try to replace VLCounter.py L36~L39 as below.
# if flag: self.gn = nn.GroupNorm(8, out_channels) self.gelu = nn.GELU() self.up = nn.UpsamplingBilinear2d(scale_factor=2)
This does not help because the dimension of the last decoder layer is one and cannot be divided by the group number 8. The exception occurred in: https://github.com/Seunggu0305/VLCounter/blob/df198668d977c0afe9ca09c8c767f2f125aabf5c/tools/models/VLCounter.py#L85 Exception has occurred: ValueError num_channels must be divisible by num_groups
Maybe the group number is 1 at the last layer?
if flag:
self.gn = nn.GroupNorm(8, out_channels)
else:
self.gn = nn.GroupNorm(1, out_channels)
self.gelu = nn.GELU()
self.up = nn.UpsamplingBilinear2d(scale_factor=2)
The reported results can be reproduced by making modifications like this. 'MAE': 16.951104744592634, 'RMSE': 106.03263784390961
The reported results can be reproduced by making modifications like this. 'MAE': 16.951104744592634, 'RMSE': 106.03263784390961
It's great to hear from you and I'll try to follow up on the change part as you said