pdfreader How to extract vectorized figures?

How to extract vectorized figures?

Open Vincent-Stragier opened this issue 11 months ago • 1 comments

Hello,

I'm trying to automate the extraction of figures from articles to easily integrate them in my reports, but I am not able to extract the vectorized figures.

For the rasterized figures, I can use the following and even, I'm not extracting the inline images:

"""Extract all images from a PDF file."""
import argparse
import os

from pdfreader import SimplePDFViewer


def images_from_viewer(viewer) -> list:
    """Yield all images from a PDF viewer.

    Args:
        viewer (SimplePDFViewer): A PDF viewer.

    Returns:
        list: A list of images.
    """
    images = []
    page_count = len(list(viewer.doc.pages()))

    for index, canvas in enumerate(viewer):
        print(f"On page {index + 1}/{page_count}", end="\r")
        page_images = canvas.images
        # print(f'Found {len(page_images)} images on page {index + 1}')

        for page_image in page_images.values():
            images.append(page_image.to_Pillow())

    print()

    return images


def save_images(images: list, path: str) -> None:
    """Save images to a path.

    Args:
        images (list): A list of images.
        path (str): A path to save images to.
    """
    for index, image in enumerate(images):
        image.save(f"{path}_{index}.png", format="png")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description=__doc__)

    parser.add_argument("pdf_path", help="Path to PDF file")
    parser.add_argument("image_path", help="Path to save images to")

    args = parser.parse_args()

    pdf_path = args.pdf_path
    image_path = args.image_path

    # Ensure that the image path exists and create it if it doesn't
    parent_dir = os.path.dirname(image_path)
    os.makedirs(parent_dir, exist_ok=True)

    with open(pdf_path, "rb") as file:
        simple_viewer = SimplePDFViewer(file)
        extracted_images = images_from_viewer(simple_viewer)
        save_images(extracted_images, image_path)

Any idea on how I could also extract the figures from a document like this one?

Aug 09 '23 08:08 Vincent-Stragier

pdfreader pdfreader copied to clipboard

How to extract vectorized figures?

pdfreader
pdfreader copied to clipboard