[BUG] Images are extracted with padding

Open Harsh-Sensei opened this issue 11 months ago • 1 comments

I am using marker-pdf version 1.2.3. After extracting markdown and images from a pdf (link), the extracted images seems to have zero padding, with the actual image on bottom left corner. (Didn't face this issue in earlier versions).

Thanks!

Jan 07 '25 02:01 Harsh-Sensei

I've determined that it's an issue with pdftext.

The immediate issue seems to be that the figure's bbox is out of bounds. In marker.builder.document.DocumentBuilder, the figure bbox is normal-sized after layout_builder, but it goes far out of bounds after a call to line_builder. LineBuilder.merge_blocks seems to be the issue.

In particular, on the 2nd page, provider output objects at indexes 3 and 6 have very negative y values. For example, the string "(a) (b) (c) (d)\n" has the bbox [62.26817321777344, -175.2513427734375, 696.563720703125, 167.5281982421875]; the string "(3) 8 8\n" has the bbox [293.2364196777344, -177.6435546875, 733.03955078125, 165.135986328125]. These spans do indeed intersect the figure, and as a result the figure is resized to include these unusual text spans, bringing the figure out of bounds.

It looks like pypdfium2 reports normal bboxes. So negative y values are being introduced within pdftext. Stepping inside PdfProvider.pdftext_extraction, the negative value is present in pdftext's dictionary_output.

One way to address this in marker is as follows:

Modify PageGroup.add_initial_blocks to merge lines, but only to expand block bboxes to the extent of the page boundary.

Feb 21 '25 03:02 conjuncts