markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

PPTX: Extract images

Open pstoeckle opened this issue 1 year ago • 4 comments

Currently, the script only extracts placeholders, i.e.,

![Content Placeholder 16](ContentPlaceholder16.jpg)

It would be nice if the tool would export the images (to the current folder or a folder passed as argument). Thus, one could see the images in the markdown preview as well.

Image

pstoeckle avatar Dec 16 '24 09:12 pstoeckle

Note: This issue also exists for Word (.docx) documents.

A potential improvement would be to export images in a format like ![](media/image1.png). This would allow users to simply unzip their Word document and retrieve the image from where Word stores it: media/image1.png.

It would also enable the possiblity to automate this example: https://github.com/microsoft/markitdown/blob/81e3f24acd0049a59cd2dcb2d01d0a98cc57c734/README.md?plain=1#L50

AlbanOtt2 avatar Dec 16 '24 14:12 AlbanOtt2

Yes, better handling of images is on my to-do list. The original purpose of the library was to support text-only LLMs, so the original focus was on extracting image metadata (e.g., tags, xmp, iptc, captions, etc.). But there's clear value in saving the images to disk and supporting them directly.

afourney avatar Dec 16 '24 18:12 afourney

would be good to have an option to export media (images....) to a specific folder (as done in pandoc)

dradoudine avatar Jan 07 '25 09:01 dradoudine

FYI and related to this issue: I've created PR #306, which describes images in PPTX files using LLMs

masquare avatar Jan 29 '25 08:01 masquare