unilm
unilm copied to clipboard
# epochs for finetuning LayoutLMv2 on DocVQA
I'm trying to recreate the results reported in the LayoutLMv2 paper, Table 6, row 7. Following this example, I've fine-tuned the base model with DocVQA train set for 20 epochs. The resulting model is under-performing compared to what's reported in the paper (roughly 40% of answers default to [CLS]). As I continue to debug the code I'm wondering how many epochs were used to fine-tune the model in the original work.
@ArmiNouri Which OCR tool do you use for the DocVQA?
I used tesseract following the collab notebook (which I realize is not the same as what you've used). If I switch to MS Read am I expected to get the same result? Alternatively if there is a pre-trained version of LayoutLMv2 that has been finetuned on DocVQA, could you make it available?
I have same issue and I used Tesseract and DocTR. Appreciate if you publish pretrained models like Layoutlm model.
In case this helps someone, I adapted the mentioned collab into a training script for the full dataset. It can train either on Tesseract OCR or the dataset OCR. It was explicitly made to evaluate LayoutLMv2 on DocVQA using Tesseract to demonstrate poor performance. It's definitely not optimized, but should be fairly good.
https://github.com/herobd/layoutlmv2