haystack icon indicating copy to clipboard operation
haystack copied to clipboard

Assess usage of LayoutLM for extracting structural elements of PDFs

Open bogdankostic opened this issue 2 years ago • 7 comments

Is your feature request related to a problem? Please describe. LayoutLM is a transformer-based model that is able to take PDFs as input and perform different tasks on them. We should asses whether we can use LayoutLM to convert PDF files to Documents. For this, we should check whether a suitable fine-tuned already exist. If not, it might be necessary to fine-tune a new one for our needs.

One dataset that might be interesting for fine-tuning is DocLayNet, a datset consisting of a variety of different PDFs labeled with regard to their Layout.

bogdankostic avatar Aug 17 '22 12:08 bogdankostic

Hello @bogdankostic and sorry for the intrusion. During the assessment, you might as well have a look at Donut. it looks interesting, even if I don't know how mature it is...

anakin87 avatar Aug 18 '22 09:08 anakin87

LayoutLM models work well for invoices and not documents. Worked on a similar use case and used DiT. But I found PaddleOCR's layoutparser model works better and faster for structure recognition. I used bbox's to compare and map text to layout box. Happy to help with this feature!!

0-hero avatar Aug 30 '22 05:08 0-hero

Having submitted #1404 in 2021, I was excited to see some movement on this topic!

Note that this subfield has moved quickly. If you're still evaluating transformer models for this task I think UDOP looks to be the most promising recent model and will hopefully be on HuggingFace soon: https://github.com/huggingface/transformers/issues/20650. Unfortunately the Microsoft team that trained the model says on that their repo that "Due to fake document generation ethical consideration, we plan to release this functionality as an Azure API", so I guess model weights will have to come from elsewhere...

hammer avatar Jan 10 '23 10:01 hammer

Hi @hammer, thanks for your interest in UDOP. We've released the encoder + text decoder model weights at https://huggingface.co/ZinengTang/Udop. By ""Due to fake document...", we mean that we need to release the vision decoding (i.e. document image generation functionality) in a more responsible way with ethical consideration.

ziyi-yang avatar Feb 19 '23 19:02 ziyi-yang

@bogdankostic @bglearning Could you share an update on Document VQA here? I know you you briefly worked on it and did some research recently. 🙂

julian-risch avatar Mar 14 '23 13:03 julian-risch

Hello @bogdankostic and sorry for the intrusion. During the assessment, you might as well have a look at Donut. it looks interesting, even if I don't know how mature it is...

Meta AI released Nougat its current codebase is built on top of Donut. It looks promising, mostly optimized for 'scientific' documents...

PAHXO avatar Sep 05 '23 17:09 PAHXO