papermage icon indicating copy to clipboard operation
papermage copied to clipboard

how to extract figures in pdf ?

Open Myfootnotsmelly opened this issue 2 years ago • 2 comments

After setup, I tried 1. doc.figures 2. json.dump

but the results showed only figure box's position and its metadata, how can i get figure in the pdf?

Myfootnotsmelly avatar Dec 19 '23 08:12 Myfootnotsmelly

Hey @Myfootnotsmelly , sorry looks like a bug introduced; adding in this pull request: https://github.com/allenai/papermage/pull/73

kyleclo avatar Mar 13 '24 22:03 kyleclo

Hihi please take a look at my response to this Issue https://github.com/allenai/papermage/issues/70

Yes, figures are represented by bounding boxes: image

If you want the image crop of the figures, here's how you'd do it:

# get the image of a page and its dimensions
page_image = doc.images[page_id]
page_w, page_h = page_image.pilimage.size

# get the bounding box of a figure
figure_box = figures[0].boxes[0]

# convert it
figure_box_xy = figure_box.to_absolute(page_width=page_w, page_height=page_h).xy_coordinates

# crop the image using PIL
page_image._pilimage.crop(figure_box_xy)
image

kyleclo avatar Mar 18 '24 17:03 kyleclo