borb
borb copied to clipboard
BUG: ImageExtraction not extracting all the images in pdf
Describe the bug not extracting all the images in pdf
To Reproduce
- For a pdf file with 9 pages, there is one image in page 6, page 7, page 8 (page num start with 0), respectively
- the ImageExtraction only detected the image in page 7 but ignored the images in page 6 and page 8
# read the Document
doc: typing.Optional[Document] = None
text_l: SimpleTextExtraction = SimpleTextExtraction()
image_l: ImageExtraction = ImageExtraction()
with open(file_path, "rb") as in_file_handle:
doc = PDF.loads(in_file_handle, [text_l, image_l])
# check whether we have read a Document
assert doc is not None
images = []
for page in range(0, 9):
if "XObject" in doc.get_page(page)["Resources"]:
for k, v in doc.get_page(page)["Resources"]["XObject"].items():
print("%d\t%s" % (page, k))
for page, content in image_l.get_images().items():
images += (content)
print(f"image page: {page}")
Expected behaviour the ImageExtraction listenser should return all the images
Screenshots
Desktop (please complete the following information):
- OS: Windows10
- borb version 2.1.10
Additional context Add any other context about the problem here.
Please attach the input PDF
@jorisschellekens i deleted some sensitive infomation from the original PDF, and the output is still not correct. the complete test code is as below
def test_pdf_with_borb(self):
doc: typing.Optional[Document] = None
text_l: SimpleTextExtraction = SimpleTextExtraction()
image_l: ImageExtraction = ImageExtraction()
file_path = PROJECT_DIR + "data/test/input_doc2.pdf"
with open(file_path, "rb") as in_file_handle:
doc = PDF.loads(in_file_handle, [text_l, image_l])
# check whether we have read a Document
assert doc is not None
images = []
page_num = int(doc.get_document_info().get_number_of_pages())
print(f"page num: {page_num}")
for page in range(0, page_num):
if "XObject" in doc.get_page(page)["Resources"]:
for k, v in doc.get_page(page)["Resources"]["XObject"].items():
print("%d\t%s" % (page, k))
for page, content in image_l.get_images().items():
images += (content)
print(f"image page: {page}")
the test output screenshot is
I checked the images in your PDF.
It turns out borb
does not currently support them (yet).
That's why they are not extracted.
what can i do to extract these images correctly? could you give me any advice, thanks a lot
You would have to implement your own version of an ImageTransformer
(package io
and read
).
Essentially you need to:
- identify when this transformer needs to be triggered
- what this transformer needs to do to convert the raw bytes to a PIL Image
I also encountered this problem. There are some pictures in png format in my pdf. I found it can not extract. There are following steps:
- write a PngImageTransformer
- write a new loads function like PDF.loads()
- add some code to insert PngImageTransformer instance to ReadAnyObjectTransformer: readAnyObjectTransformer.get_children().insert(0, PngImageTransformer())
- got the image use get_images function.
Have to say, I am learning the code. Maybe it's not the best solution.