CLIP4Clip Large performance drop if trained with fp32.

Large performance drop if trained with fp32.

Open klauscc opened this issue 2 years ago • 1 comments

Hi authors, Thanks for your great work! In the file module_clip @ L557:

convert_weights(model)
model.load_state_dict(state_dict)
return model.eval()

If I remove convert_weight, the model can only achieve an accuracy of ~40%. I can achieve ~43% if convert_weight is kept. Do you know why is this happened and is there any solution to train without convert_weight but achieve ~43%? Thanks a lot!

The reason that I want to remove convert_weights is because there are some issue with it when I am doing post-pretraining on millions of videos using CLIP. With convert_weights, the loss will become to nan at some point of training. However, if I train with FP32 or AMP there is no such issue. Training with FP32 or AMP will lead to 3% lower accuracy than FP16 (convert_weight).

Jan 16 '23 23:01 klauscc

meanP

meanP.txt

seqTransf Transf.txt

Sorry to bother you, I have run the code directly, but the loss is NaN since some wrong videos(the solution is to set the video to 0 in the provided code). If I change the code about video process, I don't know why I can only get 42.3% for meanP and 43.9% for seqTransf.

How can you get 43%? Have you modified the code for data processing?

Jun 09 '23 11:06 sweet132

CLIP4Clip CLIP4Clip copied to clipboard

Large performance drop if trained with fp32.

CLIP4Clip
CLIP4Clip copied to clipboard