unilm icon indicating copy to clipboard operation
unilm copied to clipboard

Reproducing Performance on DocVQA using LayoutLMv3/LayoutLMv2

Open allanj opened this issue 1 year ago • 3 comments

I tried my best to reproduce the results reported in the paper, which is about 78% test set ANLS. But all I get is just 74% on the test set (73% on validation set), which is still way below what is reported.

Can we know more details about how to get the reported number.

My repo: https://github.com/allanj/LayoutLMv3-DocVQA Model I'm using: LayoutLMv3-base OCR I use: Microsoft READ API, using the latest model version.

  1. I tried different matching mechanism to find the answer.
  2. Try sliding window approach (which not really works)
  3. Even following the paper to use 128 batch size, 100k optimization steps (which is equivalent to 300 epochs I don't really think that is necessary.)

The best performance I can get for using LayoutLMv3-base is just about 73.3% on validation set.

I also refer the following issues as I can't really find a public codebase that can reproduce the DocVQA results.

  1. https://github.com/NielsRogge/Transformers-Tutorials/issues/49
  2. https://github.com/microsoft/unilm/issues/616
  3. https://github.com/microsoft/unilm/issues/501
  4. https://github.com/microsoft/unilm/issues/282

Appreciate that if the authors can give more suggestions/details about the experiments.

allanj avatar Aug 26 '22 09:08 allanj

There are several steps to experiment on DocVQA with the extractive method:

  1. Pre-processing. 1.1 Get the text information of the documents using an OCR engine (e.g., Microsoft Read API) 1.2 Find the start and end token-level positions of each answer in the text (e.g., with edit distance matching)
  2. Model prediction. 2.1 Train (with parameter tuning) and predict the start and end positions with models (e.g., LayoutLMv2/3)
  3. Post-processing. 3.1 Reconstruct the answers from the text based on the predicted positions 3.2 Fix apparent errors (e.g., remove some whitespaces and punctuation marks)

Each step could have room for improvement. It can be helpful to analyze and improve the upper bound step by step. For example, what is the ANLS score calculated using the answers found by your start and end positions? If we can get a perfect text from human annotations, the score should be close to 100. With good OCR results, the score could be greater than 95.

HYPJUDY avatar Sep 03 '22 08:09 HYPJUDY

Thanks. Is it possible to provide the details about how you did it for this dataset? I think this could be important to reproduce the performance and better help the open-source community.

allanj avatar Sep 05 '22 03:09 allanj

@allanj I am trying to reproduce the result with layoulmv2 model on your code but getting below error RuntimeError: CUDA error: device-side assert triggered

This error is occurring on train_dataloader loop

roburst2 avatar Sep 23 '22 10:09 roburst2