pdfreader icon indicating copy to clipboard operation
pdfreader copied to clipboard

How to extract vectorized figures?

Open Vincent-Stragier opened this issue 1 year ago • 1 comments

Hello,

I'm trying to automate the extraction of figures from articles to easily integrate them in my reports, but I am not able to extract the vectorized figures.

For the rasterized figures, I can use the following and even, I'm not extracting the inline images:

"""Extract all images from a PDF file."""
import argparse
import os

from pdfreader import SimplePDFViewer


def images_from_viewer(viewer) -> list:
    """Yield all images from a PDF viewer.

    Args:
        viewer (SimplePDFViewer): A PDF viewer.

    Returns:
        list: A list of images.
    """
    images = []
    page_count = len(list(viewer.doc.pages()))

    for index, canvas in enumerate(viewer):
        print(f"On page {index + 1}/{page_count}", end="\r")
        page_images = canvas.images
        # print(f'Found {len(page_images)} images on page {index + 1}')

        for page_image in page_images.values():
            images.append(page_image.to_Pillow())

    print()

    return images


def save_images(images: list, path: str) -> None:
    """Save images to a path.

    Args:
        images (list): A list of images.
        path (str): A path to save images to.
    """
    for index, image in enumerate(images):
        image.save(f"{path}_{index}.png", format="png")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description=__doc__)

    parser.add_argument("pdf_path", help="Path to PDF file")
    parser.add_argument("image_path", help="Path to save images to")

    args = parser.parse_args()

    pdf_path = args.pdf_path
    image_path = args.image_path

    # Ensure that the image path exists and create it if it doesn't
    parent_dir = os.path.dirname(image_path)
    os.makedirs(parent_dir, exist_ok=True)

    with open(pdf_path, "rb") as file:
        simple_viewer = SimplePDFViewer(file)
        extracted_images = images_from_viewer(simple_viewer)
        save_images(extracted_images, image_path)

Any idea on how I could also extract the figures from a document like this one?

Vincent-Stragier avatar Aug 09 '23 08:08 Vincent-Stragier

@Vincent-Stragier Well, PDF can contain graphic objects which are technically "drawn" by commands in a coordinate system. PDF also supports different transformations of such graphics. At this moment all graphic objects and commands are available as raw data (canvas.text_content, see for example https://pdfreader.readthedocs.io/en/latest/examples/extract_page_text.html)

I can't say from the top of my head now to convert those commands into pixelated images in an easy way.

Feel free to contribute to the project if you find anything.

maxpmaxp avatar Apr 29 '24 21:04 maxpmaxp