borb BUG: ImageExtraction not extracting all the images in pdf

Describe the bug not extracting all the images in pdf

To Reproduce

For a pdf file with 9 pages, there is one image in page 6, page 7, page 8 (page num start with 0), respectively
the ImageExtraction only detected the image in page 7 but ignored the images in page 6 and page 8

# read the Document
    doc: typing.Optional[Document] = None
    text_l: SimpleTextExtraction = SimpleTextExtraction()
    image_l: ImageExtraction = ImageExtraction()

    with open(file_path, "rb") as in_file_handle:
        doc = PDF.loads(in_file_handle, [text_l, image_l])

    # check whether we have read a Document
    assert doc is not None

    images = []

    for page in range(0, 9):
        if "XObject" in doc.get_page(page)["Resources"]:
            for k, v in doc.get_page(page)["Resources"]["XObject"].items():
                print("%d\t%s" % (page, k))
    
    for page, content in image_l.get_images().items():
        images += (content)
        print(f"image page: {page}")

Expected behaviour the ImageExtraction listenser should return all the images

Screenshots

Desktop (please complete the following information):

OS: Windows10
borb version 2.1.10

Additional context Add any other context about the problem here.

Apr 30 '23 08:04 luojunhui1

Please attach the input PDF

Apr 30 '23 19:04 jorisschellekens

@jorisschellekens i deleted some sensitive infomation from the original PDF, and the output is still not correct. the complete test code is as below

def test_pdf_with_borb(self):
        doc: typing.Optional[Document] = None
        text_l: SimpleTextExtraction = SimpleTextExtraction()
        image_l: ImageExtraction = ImageExtraction()

        file_path = PROJECT_DIR + "data/test/input_doc2.pdf"
        with open(file_path, "rb") as in_file_handle:
            doc = PDF.loads(in_file_handle, [text_l, image_l])

        # check whether we have read a Document
        assert doc is not None

        images = []
        page_num = int(doc.get_document_info().get_number_of_pages())
        print(f"page num: {page_num}")

        for page in range(0, page_num):
            if "XObject" in doc.get_page(page)["Resources"]:
                for k, v in doc.get_page(page)["Resources"]["XObject"].items():
                    print("%d\t%s" % (page, k))
        
        for page, content in image_l.get_images().items():
            images += (content)
            print(f"image page: {page}")

the test output screenshot is

input_doc2.pdf

May 01 '23 07:05 luojunhui1

I checked the images in your PDF. It turns out borb does not currently support them (yet). That's why they are not extracted.

May 01 '23 08:05 jorisschellekens

what can i do to extract these images correctly? could you give me any advice, thanks a lot

May 01 '23 15:05 luojunhui1

You would have to implement your own version of an ImageTransformer (package io and read).

Essentially you need to:

identify when this transformer needs to be triggered
what this transformer needs to do to convert the raw bytes to a PIL Image

May 02 '23 16:05 jorisschellekens

I also encountered this problem. There are some pictures in png format in my pdf. I found it can not extract. There are following steps：

write a PngImageTransformer
write a new loads function like PDF.loads()
add some code to insert PngImageTransformer instance to ReadAnyObjectTransformer: readAnyObjectTransformer.get_children().insert(0, PngImageTransformer())
got the image use get_images function.

Have to say, I am learning the code. Maybe it's not the best solution.

Aug 23 '23 09:08 hdoer

borb borb copied to clipboard

BUG: ImageExtraction not extracting all the images in pdf

borb
borb copied to clipboard