pdfreader
pdfreader copied to clipboard
How to extract vectorized figures?
Hello,
I'm trying to automate the extraction of figures from articles to easily integrate them in my reports, but I am not able to extract the vectorized figures.
For the rasterized figures, I can use the following and even, I'm not extracting the inline images
:
"""Extract all images from a PDF file."""
import argparse
import os
from pdfreader import SimplePDFViewer
def images_from_viewer(viewer) -> list:
"""Yield all images from a PDF viewer.
Args:
viewer (SimplePDFViewer): A PDF viewer.
Returns:
list: A list of images.
"""
images = []
page_count = len(list(viewer.doc.pages()))
for index, canvas in enumerate(viewer):
print(f"On page {index + 1}/{page_count}", end="\r")
page_images = canvas.images
# print(f'Found {len(page_images)} images on page {index + 1}')
for page_image in page_images.values():
images.append(page_image.to_Pillow())
print()
return images
def save_images(images: list, path: str) -> None:
"""Save images to a path.
Args:
images (list): A list of images.
path (str): A path to save images to.
"""
for index, image in enumerate(images):
image.save(f"{path}_{index}.png", format="png")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description=__doc__)
parser.add_argument("pdf_path", help="Path to PDF file")
parser.add_argument("image_path", help="Path to save images to")
args = parser.parse_args()
pdf_path = args.pdf_path
image_path = args.image_path
# Ensure that the image path exists and create it if it doesn't
parent_dir = os.path.dirname(image_path)
os.makedirs(parent_dir, exist_ok=True)
with open(pdf_path, "rb") as file:
simple_viewer = SimplePDFViewer(file)
extracted_images = images_from_viewer(simple_viewer)
save_images(extracted_images, image_path)
Any idea on how I could also extract the figures from a document like this one?
@Vincent-Stragier Well, PDF can contain graphic objects which are technically "drawn" by commands in a coordinate system. PDF also supports different transformations of such graphics. At this moment all graphic objects and commands are available as raw data (canvas.text_content
, see for example https://pdfreader.readthedocs.io/en/latest/examples/extract_page_text.html)
I can't say from the top of my head now to convert those commands into pixelated images in an easy way.
Feel free to contribute to the project if you find anything.