ParlAI Can't reproduce the performance for "personality captions" task

Bug description

Dear ParlAI team, first I want to thank you for providing the personality captions dataset and related code. I really appreciate the effort for making everything publicly available and putting up detailed documents.

Here is what I encountered. I want to reproduce the evaluation results provided in this page. The accuracy I got running provided command is .0122, which is far from the reported value (0.5113).

Can someone please help me check if there is anything I did wrong or I was missing? Thank you very much!

Reproduction steps

I cloned ParlAI repo to my local machine.
I downloaded train.json, val.json and test.json using the following command. python parlai/scripts/display_data.py -t personality_captions
Because there are 55 missing images from yfcc dataset, I removed the data corresponding to missing images from train.json, val.json and test.json. This resulted in 9997 data in test.json instead of 10000.
I ran the following command provided in this page. I added the yfcc path argument. The command first downloaded transresnset model, which should be able to reproduce the same result as reported.

parlai eval_model \
      -bs 128 -t personality_captions
      -mf models:personality_captions/transresnet/model
      --yfcc-path my_path
      --num-test-labels 5 -dt test

Expected behavior Expected accuracy should be like the one in this page.

{'exs': 10000, 'accuracy': 0.5113, 'f1': 0.5951, 'hits@1': 0.511, 'hits@5': 0.816,
  'hits@10': 0.903, 'hits@100': 0.998, 'bleu': 0.4999, 'hits@1/100': 1.0,
  'loss': -0.002, 'med_rank': 1.0}

Logs Here is the result I got:

14:15:50 | Finished evaluating tasks ['personality_captions'] using datatype test
           100  accuracy  bleu-4  exs    f1  hits@1  hits@10  hits@100  hits@5     loss  med_rank  precision  recall
   all             .0122  .01179 9997 .1585   .0122   .09863     .5661  .05422 -.007902        83      .1898   .1601
   hits@1    1

Sep 14 '22 05:09 tj-zhu

You may need to set --image-mode resnet152 when evaluating

Sep 14 '22 15:09 klshuster

@klshuster Thank you very much for your answer! Yes I added --image-mode resnet152 and now the results look very close to the reported ones.

I do have one more question if you could kindly help on it.

I ran the train_model.py command and the accuracy I got on test set is .2874 (it was .5165 when I evaluate the downloaded model before fine-tuning). I notice the weights of transformers didn't change and only other layers got updated, which is expected. The data is still yfcc, so this fine-tuning didn't really feed the model with any new data. But still, I don't quite understand why the accuracy dropped this much. I am wondering if you have any idea on why this happened and how I should fix it. Thank you very much for your help! I really appreciate it.

The command I used:

python parlai/scripts/train_model.py --task personality_captions \
    --model-file models:personality_captions/transresnet/model \
    --dict-file zoo:personality_captions/transresnet/model.dict \
    --image-mode resnet152 --batchsize 500 \
    --embedding-type fasttext_cc \
    --yfcc-path my_path \
    --validation-every-n-epochs 1 --validation-patience 1 -validation-metric accuracy --validation-metric-mode max

The results I got after 3 epochs of training:

 test:
           100  accuracy  bleu-4  exs    f1  hits@1  hits@10  hits@100  hits@5     loss  med_rank  precision  recall
   all             .2874   .2754 9997 .3284   .2874    .6920         1   .5635 -.002001         4      .3306   .3356
   hits@1    1

Sep 15 '22 11:09 tj-zhu

The released model uses a pre-trained text encoder (see section 4.3 of the paper), while the model you are training is from scratch.

Sep 16 '22 16:09 klshuster

@klshuster Thank you for your answer! That's what I thought originally but I found that self.text_encoder_frozen = True during fine-tuning, and the params of Transformer part of the fine-tuned model are the same as the released model. So actually it is using the pre-trained transformer instead of from scratch.

Sep 17 '22 07:09 tj-zhu

oh i think you should actually remove and redownload the model.

When fine-tuning, the --model-file is where you save the model checkpoint; you would wnat to use --init-model to indicate that you want to initialize with the model weights. It's possible that you've actually overwritten the pre-trained model with your saved model

Sep 21 '22 20:09 klshuster

@klshuster Thank you for your advice. Actually I always delete the saved model and let the command to re-download a new released model before training.

According to https://github.com/facebookresearch/ParlAI/blob/9f7fc0be0f2f618f37ed2b493ce8e7892c729875/projects/personality_captions/transresnet/transresnet.py#L134-L147, both init-model and model-file are used to load a previous model. So I guess the difference is that the checkpoint will overwrite the model in model-file instead of the one in init-model?

However, I am not sure whether it can explain the issue that I can't reproduce the reported performance from the model I fine-tuned. Regardless of the fine-tuned model overwrite the pre-trained model or not, since it's loaded from a good model, it should still be able to reach a decent performance, right?

Sep 22 '22 03:09 tj-zhu

You are correct about the difference between the two - we generally use model-file for inference, and init-model for training (so as not to overwrite the original checkpoint).

In any event, why would you want to fine-tune the already pre-trained checkpoint? As the model already achieves good performance?

Sep 22 '22 16:09 klshuster

@klshuster that's a good question :) At first I just wanted to make sure that I understand the model architecture by rerunning it. Now I am purely curious of the reasons of the suboptimal performance. I totally agree that "fine-tuning an already pre-trained checkpoint" is pointless, but still I am wondering which factor is causing the performance not improving but instead dropping drastically.

Sep 27 '22 02:09 tj-zhu

You are not specifying --num-test-labels 5 in your train script, and thus are evaluating on a harder test set

Sep 27 '22 13:09 klshuster

@klshuster Yes, you are completely right! Sorry for missing this obvious argument. Now I can reproduce the results as follows:

16:52:45 | test:
           100  accuracy  bleu-4  exs    f1  hits@1  hits@10  hits@100  hits@5     loss  med_rank  precision  recall
   all             .5163   .5047 9997 .5989   .5164    .9080     .9982   .8225 -.002001         1      .6030   .6123
   hits@1    1

Thank you very much for your patience and help! I really appreciate it.

Oct 03 '22 07:10 tj-zhu