CLIP4Clip
CLIP4Clip copied to clipboard
Large performance drop if trained with fp32.
Hi authors, Thanks for your great work! In the file module_clip @ L557:
convert_weights(model)
model.load_state_dict(state_dict)
return model.eval()
If I remove convert_weight, the model can only achieve an accuracy of ~40%. I can achieve ~43% if convert_weight is kept.
Do you know why is this happened and is there any solution to train without convert_weight but achieve ~43%? Thanks a lot!
The reason that I want to remove convert_weights is because there are some issue with it when I am doing post-pretraining on millions of videos using CLIP. With convert_weights, the loss will become to nan at some point of training. However, if I train with FP32 or AMP there is no such issue. Training with FP32 or AMP will lead to 3% lower accuracy than FP16 (convert_weight).
meanP
seqTransf Transf.txt
Sorry to bother you, I have run the code directly, but the loss is NaN since some wrong videos(the solution is to set the video to 0 in the provided code). If I change the code about video process, I don't know why I can only get 42.3% for meanP and 43.9% for seqTransf.
How can you get 43%? Have you modified the code for data processing?