i-Code Finetuning on InfographicVQA

I was unable to achieve the result shown in the UDOP paper.

I used the udop-unimodel-large-224 checkpoint.

My ANLS score is 0.407903. This is nowhere near 0.461 as shown in the table below taken from the paper.

Since I noticed that the batch size, warmup steps and weight decay given in https://github.com/microsoft/i-Code/blob/main/i-Code-Doc/scripts/finetune_duebenchmark.sh is different from reported in the paper, I also tried changing the finetuning script to use the paper's settings.

Lastly, I also tried adding the task prompt prefix since it is not done so in the existing code. I followed approach from https://github.com/microsoft/i-Code/issues/71#issuecomment-1623201208

Results of the 3 different finetuning configurations:

Task prefix	Hyperparameter settings	ANLS Score
No	Unchanged finetuning script	0.407903
No	Paper's settings	0.40174
Yes	Unchanged finetuning script	0.408355

Other changes I made:

Change to use pytorch's AdamW, based from https://github.com/microsoft/i-Code/issues/63#issuecomment-1608019905

Within baselines-master in due-benchmark repo:
- Apply fix from https://github.com/due-benchmark/baselines/issues/7#issue-1638167863
- in baselines-master/benchmarker/data/utils.py, I changed dtype of label_name from U100 to U1024 to prevent truncation of questions during display

Please assist

Feb 16 '24 02:02 Caixin89

May I know if the results shown in table 8 above is validation set or test set scores?

Feb 16 '24 02:02 Caixin89

table 8 shows validation results. may I know how many epochs have you run the model and what checkpoint did you use?

Feb 17 '24 08:02 zinengtang

4 epochs for the 2 runs that use unchanged finetuning script 5 epochs when I changed the finetuning script to paper's settings

The last epochs are automatically decided based on early_stopping_patience=20.

Feb 19 '24 02:02 Caixin89

I am assuming you are using the last checkpoint the run generated instead of intermediate checkpoint? If so, try using more epochs. If it still doesn't work, I will provide finetuned checkpoint to see if the issue is on the evaluation script?

Feb 19 '24 09:02 zinengtang

Sure, I can try that. In the mean while, could you share what is the number of epochs you have used for finetuning?

Feb 20 '24 01:02 Caixin89

The above is a plot of the validation loss against training steps. The validaton loss is increasing consistently acorss training steps.

Is this expected?

Feb 20 '24 04:02 Caixin89

I was unable to achieve the result shown in the UDOP paper.

I used the udop-unimodel-large-224 checkpoint.

My ANLS score is 0.407903. This is nowhere near 0.461 as shown in the table below taken from the paper.

Since I noticed that the batch size, warmup steps and weight decay given in https://github.com/microsoft/i-Code/blob/main/i-Code-Doc/scripts/finetune_duebenchmark.sh is different from reported in the paper, I also tried changing the finetuning script to use the paper's settings.
Lastly, I also tried adding the task prompt prefix since it is not done so in the existing code. I followed approach from [#71 (comment)](https://github.com/microsoft/i-Code/issues/71#issuecomment-1623201208)
Results of the 3 different finetuning configurations:

Task prefix Hyperparameter settings ANLS Score No Unchanged finetuning script 0.407903 No Paper's settings 0.40174 Yes Unchanged finetuning script 0.408355 Other changes I made:

Change to use pytorch's AdamW, based from loss does not have a grad fn #63 (comment) Within baselines-master in due-benchmark repo:

Apply fix from KeyError: 'common_format' due-benchmark/baselines#7 (comment)

in baselines-master/benchmarker/data/utils.py, I changed dtype of label_name from U100 to U1024 to prevent truncation of questions during display

Please assist

May i ask how you have implemented ANLS metric for the task?

Feb 27 '24 09:02 Pietro1999IT

I was unable to achieve the result shown in the UDOP paper. I used the udop-unimodel-large-224 checkpoint. My ANLS score is 0.407903. This is nowhere near 0.461 as shown in the table below taken from the paper. Since I noticed that the batch size, warmup steps and weight decay given in https://github.com/microsoft/i-Code/blob/main/i-Code-Doc/scripts/finetune_duebenchmark.sh is different from reported in the paper, I also tried changing the finetuning script to use the paper's settings. Lastly, I also tried adding the task prompt prefix since it is not done so in the existing code. I followed approach from #71 (comment) Results of the 3 different finetuning configurations: Task prefix Hyperparameter settings ANLS Score No Unchanged finetuning script 0.407903 No Paper's settings 0.40174 Yes Unchanged finetuning script 0.408355 Other changes I made:

Change to use pytorch's AdamW, based from loss does not have a grad fn #63 (comment) Within baselines-master in due-benchmark repo:

Apply fix from KeyError: 'common_format' due-benchmark/baselines#7 (comment)

in baselines-master/benchmarker/data/utils.py, I changed dtype of label_name from U100 to U1024 to prevent truncation of questions during display

Please assist

May i ask how you have implemented ANLS metric for the task?

should be in this repo https://github.com/due-benchmark/evaluator/tree/master

Feb 27 '24 18:02 yuanzheng625

Yes, I used ANLS from https://github.com/due-benchmark/evaluator/tree/master.

Mar 04 '24 02:03 Caixin89

I am assuming you are using the last checkpoint the run generated instead of intermediate checkpoint? If so, try using more epochs. If it still doesn't work, I will provide finetuned checkpoint to see if the issue is on the evaluation script?

I have tried with 10 epochs and my ANLS is still ~0.41. Am I supposed to finetune with even more epochs?

Could you provide me with your finetuned checkpoint?

Mar 04 '24 02:03 Caixin89

Also I would like to double check that the 46.1 ANLS score is indeed based on fine-tuning of udop-unimodel-large-224 checkpoint without additional supervised pre-training.

Correct?

Mar 04 '24 03:03 Caixin89

I am assuming you are using the last checkpoint the run generated instead of intermediate checkpoint? If so, try using more epochs. If it still doesn't work, I will provide finetuned checkpoint to see if the issue is on the evaluation script?

I have tried with 10 epochs and my ANLS is still ~0.41. Am I supposed to finetune with even more epochs?

Could you provide me with your finetuned checkpoint?

@zinengtang Any updates?

Apr 08 '24 04:04 Caixin89

i-Code i-Code copied to clipboard

Finetuning on InfographicVQA

i-Code
i-Code copied to clipboard