layout-parser icon indicating copy to clipboard operation
layout-parser copied to clipboard

How I can extract Titles, Headers , Photos and respective article information from Newspaper?

Open karndeepsingh opened this issue 2 years ago • 1 comments

Hi, I have been trying to implement the Newspaper navigator model for my application. However, it is able to detect the regions like title or whole article. But I want to extract title and its below paragraphs for my usecase. How I can do that? Please help me to resolve this issue. Is their any tutorial available to guide on it?

Thanks

karndeepsingh avatar Mar 11 '23 09:03 karndeepsingh

You are asking for a complete document layout task! This is not an issue, its a task. Combine object detection (bigger bboxes) with pdf_parser output (bboxes for every word or line). Filter the lines/words output by the bigger boxes predicted by Vision Models. You can leverage spatial correlation (sort by width, then height) to identify words in the same line or a heading above a paragraph (heading will be one-liner, identified a bbox with bigger area than others plus height of heading < height of paragraph). Hope that helps 👯

nkoudounas avatar Oct 04 '23 09:10 nkoudounas