Can't reproduce the performance for "personality captions" task
Bug description
Dear ParlAI team, first I want to thank you for providing the personality captions dataset and related code. I really appreciate the effort for making everything publicly available and putting up detailed documents.
Here is what I encountered. I want to reproduce the evaluation results provided in this page. The accuracy I got running provided command is .0122, which is far from the reported value (0.5113).
Can someone please help me check if there is anything I did wrong or I was missing? Thank you very much!
Reproduction steps
- I cloned ParlAI repo to my local machine.
- I downloaded train.json, val.json and test.json using the following command.
python parlai/scripts/display_data.py -t personality_captions - Because there are 55 missing images from yfcc dataset, I removed the data corresponding to missing images from train.json, val.json and test.json. This resulted in 9997 data in test.json instead of 10000.
- I ran the following command provided in this page. I added the yfcc path argument. The command first downloaded transresnset model, which should be able to reproduce the same result as reported.
parlai eval_model \
-bs 128 -t personality_captions
-mf models:personality_captions/transresnet/model
--yfcc-path my_path
--num-test-labels 5 -dt test
Expected behavior Expected accuracy should be like the one in this page.
{'exs': 10000, 'accuracy': 0.5113, 'f1': 0.5951, 'hits@1': 0.511, 'hits@5': 0.816,
'hits@10': 0.903, 'hits@100': 0.998, 'bleu': 0.4999, 'hits@1/100': 1.0,
'loss': -0.002, 'med_rank': 1.0}
Logs Here is the result I got:
14:15:50 | Finished evaluating tasks ['personality_captions'] using datatype test
100 accuracy bleu-4 exs f1 hits@1 hits@10 hits@100 hits@5 loss med_rank precision recall
all .0122 .01179 9997 .1585 .0122 .09863 .5661 .05422 -.007902 83 .1898 .1601
hits@1 1
You may need to set --image-mode resnet152 when evaluating
@klshuster Thank you very much for your answer! Yes I added --image-mode resnet152 and now the results look very close to the reported ones.
I do have one more question if you could kindly help on it.
I ran the train_model.py command and the accuracy I got on test set is .2874 (it was .5165 when I evaluate the downloaded model before fine-tuning). I notice the weights of transformers didn't change and only other layers got updated, which is expected. The data is still yfcc, so this fine-tuning didn't really feed the model with any new data. But still, I don't quite understand why the accuracy dropped this much. I am wondering if you have any idea on why this happened and how I should fix it. Thank you very much for your help! I really appreciate it.
The command I used:
python parlai/scripts/train_model.py --task personality_captions \
--model-file models:personality_captions/transresnet/model \
--dict-file zoo:personality_captions/transresnet/model.dict \
--image-mode resnet152 --batchsize 500 \
--embedding-type fasttext_cc \
--yfcc-path my_path \
--validation-every-n-epochs 1 --validation-patience 1 -validation-metric accuracy --validation-metric-mode max
The results I got after 3 epochs of training:
test:
100 accuracy bleu-4 exs f1 hits@1 hits@10 hits@100 hits@5 loss med_rank precision recall
all .2874 .2754 9997 .3284 .2874 .6920 1 .5635 -.002001 4 .3306 .3356
hits@1 1
The released model uses a pre-trained text encoder (see section 4.3 of the paper), while the model you are training is from scratch.
@klshuster Thank you for your answer!
That's what I thought originally but I found that self.text_encoder_frozen = True during fine-tuning, and the params of Transformer part of the fine-tuned model are the same as the released model. So actually it is using the pre-trained transformer instead of from scratch.
oh i think you should actually remove and redownload the model.
When fine-tuning, the --model-file is where you save the model checkpoint; you would wnat to use --init-model to indicate that you want to initialize with the model weights. It's possible that you've actually overwritten the pre-trained model with your saved model
@klshuster Thank you for your advice. Actually I always delete the saved model and let the command to re-download a new released model before training.
According to https://github.com/facebookresearch/ParlAI/blob/9f7fc0be0f2f618f37ed2b493ce8e7892c729875/projects/personality_captions/transresnet/transresnet.py#L134-L147, both init-model and model-file are used to load a previous model. So I guess the difference is that the checkpoint will overwrite the model in model-file instead of the one in init-model?
However, I am not sure whether it can explain the issue that I can't reproduce the reported performance from the model I fine-tuned. Regardless of the fine-tuned model overwrite the pre-trained model or not, since it's loaded from a good model, it should still be able to reach a decent performance, right?
You are correct about the difference between the two - we generally use model-file for inference, and init-model for training (so as not to overwrite the original checkpoint).
In any event, why would you want to fine-tune the already pre-trained checkpoint? As the model already achieves good performance?
@klshuster that's a good question :) At first I just wanted to make sure that I understand the model architecture by rerunning it. Now I am purely curious of the reasons of the suboptimal performance. I totally agree that "fine-tuning an already pre-trained checkpoint" is pointless, but still I am wondering which factor is causing the performance not improving but instead dropping drastically.
You are not specifying --num-test-labels 5 in your train script, and thus are evaluating on a harder test set
@klshuster Yes, you are completely right! Sorry for missing this obvious argument. Now I can reproduce the results as follows:
16:52:45 | test:
100 accuracy bleu-4 exs f1 hits@1 hits@10 hits@100 hits@5 loss med_rank precision recall
all .5163 .5047 9997 .5989 .5164 .9080 .9982 .8225 -.002001 1 .6030 .6123
hits@1 1
Thank you very much for your patience and help! I really appreciate it.