[BUG] Images are extracted with padding
I am using marker-pdf version 1.2.3.
After extracting markdown and images from a pdf (link), the extracted images seems to have zero padding, with the actual image on bottom left corner. (Didn't face this issue in earlier versions).
Thanks!
I've determined that it's an issue with pdftext.
The immediate issue seems to be that the figure's bbox is out of bounds. In marker.builder.document.DocumentBuilder, the figure bbox is normal-sized after layout_builder, but it goes far out of bounds after a call to line_builder. LineBuilder.merge_blocks seems to be the issue.
In particular, on the 2nd page, provider output objects at indexes 3 and 6 have very negative y values. For example, the string "(a) (b) (c) (d)\n" has the bbox [62.26817321777344, -175.2513427734375, 696.563720703125, 167.5281982421875]; the string "(3) 8 8\n" has the bbox [293.2364196777344, -177.6435546875, 733.03955078125, 165.135986328125]. These spans do indeed intersect the figure, and as a result the figure is resized to include these unusual text spans, bringing the figure out of bounds.
It looks like pypdfium2 reports normal bboxes. So negative y values are being introduced within pdftext. Stepping inside PdfProvider.pdftext_extraction, the negative value is present in pdftext's dictionary_output.
One way to address this in marker is as follows:
- Modify
PageGroup.add_initial_blocksto merge lines, but only to expand block bboxes to the extent of the page boundary.