layout-parser icon indicating copy to clipboard operation
layout-parser copied to clipboard

Multi modal approach to LP's Deep Layout Parsing capability

Open nasheedyasin opened this issue 3 years ago • 3 comments

Motivation So basically when it comes to layout Parsing of Forms and other such structured data. I have noticed that just having access to the image features of a region of interest could lead to quite a few false positives. If we could have a multimodal approach where we also take into consideration the text present within these regions, to then form a richer representation, we could considerablbly improve the performance over the existing pure object detection methodology.

Ofcourse this is relevant only for structured documents like forms and invoices. But I'm guessing a vast majority of your users, much like myself would be interested in such a feature.

PS: Would love to work on developing such a feature with you all.

For reference: a form like this.

@lolipopshock

nasheedyasin avatar Jun 20 '21 18:06 nasheedyasin

Thanks! As mentioned in the layout-parser paper , this is the direction we are working on right now. I'll share more information in this thread later when there are more updates.

lolipopshock avatar Jul 06 '21 15:07 lolipopshock

Hello everyone, any update on this front? for some documents is really impossible to identify the correct layout without incorporating the semantic context of the text.

alejandrojcastaneira avatar Jul 05 '22 10:07 alejandrojcastaneira

Hello everyone, any update on this front? for some documents is really impossible to identify the correct layout without incorporating the semantic context of the text.

The best way forward would be to integrate huggingface-based models like LayoutLMv3 into the Layout Parser ecosystem. I believe there was some work in this direction. @lolipopshock will be able to tell you more.

nasheedyasin avatar Jul 05 '22 10:07 nasheedyasin