unilm LayoutLMv3 DocVQA: How was training kept stable for 100k, 200k steps (over 300 epochs!)

Hello,

finetuning on docvqa

I am doing some research on VQA, and I was working with LayoutLMv3 finetuning on DocVQA.

Here is the description from the paper on how LayoutLMv3 was finetuned on DocVQA:

how i am finetuning on docvqa

preparation and scripts

Following this, I wrote a training script: https://github.com/redthing1/layoutlm_experiments/blob/main/llm_tests/llm_tests/train_docvqa.py

This is my code I am using to preprocess data: https://github.com/redthing1/layoutlm_experiments/blob/main/llm_tests/llm_tests/prep_docvqa_data.py, and I am collecting my OCR data using Microsoft Read API just like the authors, then applying a nearby-combination algorithm to combine bounding boxes of segments.

And I run my training with the same parameters as specified in the paper.

I ran this training process at least 10 times for every set of parameters.

The official partition of the DocVQA dataset consists of 10,194/1,286/1,287 images and 39,463/5,349/5,188 questions

Note based on this: 100,000 steps thus means: 39463 questions/128 batch size ~= 308.3 steps per epoch. In total, this is 100,000 / 308.3 ~= 324.35 epochs

train results

When run on base model, this is what my results look like, over and over (just showing my best run for clarity)

The best eval loss is around ~7k steps and the model overfits after that (I ran up to about 30,000 steps seeing increasing eval loss on other runs before giving up)

I ran ANLS metrics to get 0.746, which is lower than the paper but at least in reasonable range of it.

And later, I ran a similar experiment on the large model, again using the params from the paper:

This run is incomplete, but we can see that it appears to begin overfitting around just 11k steps.

my question

My key question is this: How did you manage to keep stable, improving training running for 200k steps?

Jul 24 '22 00:07 redthing1

Thank you for the question and detailed description. There is no necessary relationship between evaluation metrics and evaluation loss (see some explanations). For example, when fine-tuning FUNSD, the evaluation loss goes up after 150 steps, but all metrics are still improving. You can further adjust the hyperparameters according to your experimental phenomena.

Aug 03 '22 03:08 HYPJUDY

I also wonder what would be the performance if we just use the provided OCR?

Aug 03 '22 03:08 allanj

I think the performance would be worse using the less accurate OCR (the one provided). I do not have a quantitative comparison because I did not experiment with this.

Aug 03 '22 07:08 HYPJUDY

Thanks for the reply. Yeah, I'm also trying to benchmark the OCR by that, stay tuned here: https://github.com/allanj/LayoutLMv3-DocVQA

Aug 03 '22 08:08 allanj

w

Thanks for the reply. Yeah, I'm also trying to benchmark the OCR by that, stay tuned here: https://github.com/allanj/LayoutLMv3-DocVQA

I tried it and I can confirm that Tesseract OCR is very bad (~.10 ANLS lower). And the provided OCR is ok, but MSRead is still about 0.03 or so ANLS higher. I recommend using MSRead. By the way I have spent the money to get the transcriptions and if you want let me know by emailing my profile email.

Aug 03 '22 16:08 redthing1

@redthing1 Thank you for reproducing the results on the DocVQA dataset. I also agree to want to do this work. Can you provide the scripts transcriptions？ Thank you ver much!

Apr 17 '23 07:04 minhoooo1

@redthing1 Thank you for reproducing the results on the DocVQA dataset. I also agree to want to do this work. Can you provide the scripts transcriptions？ Thank you ver much!

I am happy to upload if you give me a place to upload about 10GB. You can contact me at [email protected].

Apr 17 '23 19:04 redthing1

I keep getting emails about people asking for my MSRead OCR processed dataset. So for everyone's sake I will upload it for you all in the next few days.

May 19 '23 07:05 redthing1

You will have to use my training scripts here because this is preprocessed data in ARROW format: https://github.com/redthing1/layoutlm_experiments#2-train-on-docvqa-data

But if you want you can convert it back. I've run everything through MSRead so it should be very similar to the dataset the authors used.

I will be uploading dataset archives soon.

May 19 '23 07:05 redthing1

Dataset archives: https://github.com/redthing1/layoutlm_experiments/releases/download/v0.1.0/docvqa_proc_20220808.tar.zst This includes MSRead processed data, in both formats of seq2seq and extractive formulation. You can load with my script (which uses HF datasets) and convert it to whatever format you need. I don't have time to write converter myself but it should be easy, and my release includes all the MSRead outputs. It cost between $200-$500.

May 19 '23 18:05 redthing1

I'm sorry that I haven't been able to reply to you today while busy with company work. I was originally trying to reply to your first email today, but I was interrupted by work

------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2023年5月20日(星期六) 凌晨2:38 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [microsoft/unilm] LayoutLMv3 DocVQA: How was training kept stable for 100k, 200k steps (over 300 epochs!) (Issue #799)

Dataset archives: https://github.com/redthing1/layoutlm_experiments/releases/download/v0.1.0/docvqa_proc_20220808.tar.zst This includes MSRead processed data, in both formats of seq2seq and extractive formulation. You can load with my script (which uses HF datasets) and convert it to whatever format you need. I don't have time to write converter myself but it should be easy, and my release includes all the MSRead outputs. It cost between $200-$500.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

May 19 '23 18:05 minhoooo1

Dataset archives: https://github.com/redthing1/layoutlm_experiments/releases/download/v0.1.0/docvqa_proc_20220808.tar.zst This includes MSRead processed data, in both formats of seq2seq and extractive formulation. You can load with my script (which uses HF datasets) and convert it to whatever format you need. I don't have time to write converter myself but it should be easy, and my release includes all the MSRead outputs. It cost between $200-$500.

Thanks a lot！

May 23 '23 01:05 volcano1995

Dataset archives: https://github.com/redthing1/layoutlm_experiments/releases/download/v0.1.0/docvqa_proc_20220808.tar.zst This includes MSRead processed data, in both formats of seq2seq and extractive formulation. You can load with my script (which uses HF datasets) and convert it to whatever format you need. I don't have time to write converter myself but it should be easy, and my release includes all the MSRead outputs. It cost between $200-$500.

Hi, redthing, I use the MSRead processed data from the link above. I find there are train and val results but without test, I wonder if you can update the test data? Thanks a lot!

May 26 '23 07:05 volcano1995

I don't have the test data 😂 it is not released for obvious reasons.

May 26 '23 08:05 redthing1

I don't have the test data 😂 it is not released for obvious reasons.

Sorry, I mean the MSRead ocr results of test document, not the QA annotations.

May 26 '23 08:05 volcano1995

unilm unilm copied to clipboard

LayoutLMv3 DocVQA: How was training kept stable for 100k, 200k steps (over 300 epochs!)

finetuning on docvqa

how i am finetuning on docvqa

preparation and scripts

train results

my question

unilm
unilm copied to clipboard