unstructured
unstructured copied to clipboard
docx: access images in document order
We want to Include images in the docx partitioner element stream.
- The stream should include an element for each (qualified) image embedded in the document.
- The image should be suitable for OCR and perhaps other purposes.
- Image elements should appear in document order, between text prior to the image and text following it.
python-docx doesn't yet provide ready access to embedded images the way python-pptx does.
Add a way to access these images that preserves their location in the document text such that their corresponding document element can appear in correct document order and with accurate page number, etc.
I have this exact same necessity.
I've successfully managed to implement this manually using just python-docx and BytesIO, grabbing each image in order of appearance and processing them through our OCR.
Sadly, i can't share the code as it is for the organization i work for.
And the issue with my implementation is that it's really hard to grab both image data and tabular data in order at the same time. If unstructured supported both, i could just use it in this case, but it only extracts text and tabular data...
The solution (or rather, workaround) i got right now is to process each doc twice: once in my pipeline to get the text + images, and once in unstructured to get the text + tables. Then join the 2, ignoring the duplicated paragraphs.
Closing, image extraction for Word docs is available in the API now.
@MthwRobinson, can you please provide a link to the documentation with the API to extract images from docx? I can not find it.
Thanks!
We need to add this to the docs but if you use the hi res strategy the base64 encoding for the image will be available at metadata.image_base64. With the API client SDK, you can do something like the following. I think image extraction for docx may not have promoted to our prod API yet but should be available within the next few weeks.
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError
client = UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY)
with open(filename, "rb") as f:
files=shared.Files(
content=f.read(),
file_name=filename,
)
req = shared.PartitionParameters(
files=files,
strategy="hi_res"
)
try:
resp = client.general.partition(req)
return resp.elements
except SDKError as e:
return e
And then get the image out with:
first_image = response[0].get('metadata').get('image_base64')
image_bytes = base64.b64decode(first_image)
Image(data=image_bytes)