unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

docx: access images in document order

Open scanny opened this issue 2 years ago • 1 comments

We want to Include images in the docx partitioner element stream.

  • The stream should include an element for each (qualified) image embedded in the document.
  • The image should be suitable for OCR and perhaps other purposes.
  • Image elements should appear in document order, between text prior to the image and text following it.

python-docx doesn't yet provide ready access to embedded images the way python-pptx does.

Add a way to access these images that preserves their location in the document text such that their corresponding document element can appear in correct document order and with accurate page number, etc.

scanny avatar Oct 02 '23 18:10 scanny

I have this exact same necessity. I've successfully managed to implement this manually using just python-docx and BytesIO, grabbing each image in order of appearance and processing them through our OCR. Sadly, i can't share the code as it is for the organization i work for.

And the issue with my implementation is that it's really hard to grab both image data and tabular data in order at the same time. If unstructured supported both, i could just use it in this case, but it only extracts text and tabular data...

The solution (or rather, workaround) i got right now is to process each doc twice: once in my pipeline to get the text + images, and once in unstructured to get the text + tables. Then join the 2, ignoring the duplicated paragraphs.

GustavoSept avatar Mar 19 '24 10:03 GustavoSept

Closing, image extraction for Word docs is available in the API now.

MthwRobinson avatar Jun 13 '24 13:06 MthwRobinson

@MthwRobinson, can you please provide a link to the documentation with the API to extract images from docx? I can not find it.

Thanks!

dalessioluca avatar Jun 13 '24 14:06 dalessioluca

We need to add this to the docs but if you use the hi res strategy the base64 encoding for the image will be available at metadata.image_base64. With the API client SDK, you can do something like the following. I think image extraction for docx may not have promoted to our prod API yet but should be available within the next few weeks.


    from unstructured_client import UnstructuredClient
    from unstructured_client.models import shared
    from unstructured_client.models.errors import SDKError

    client = UnstructuredClient(api_key_auth=UNSTRUCTURED_API_KEY)

    with open(filename, "rb") as f:
        files=shared.Files(
            content=f.read(),
            file_name=filename,
        )

    req = shared.PartitionParameters(
        files=files,
        strategy="hi_res"
    )

    try:
        resp = client.general.partition(req)
        return resp.elements
        
    except SDKError as e:
        return e

And then get the image out with:

first_image = response[0].get('metadata').get('image_base64')
image_bytes = base64.b64decode(first_image)
Image(data=image_bytes)

MthwRobinson avatar Jun 13 '24 14:06 MthwRobinson