markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

Images in docx files cannot be converted to md documents

Open keller31 opened this issue 8 months ago • 4 comments

The images in the document are converted into codes similar to the following, but they are incomplete and lack base64 content. ![](data:image/jpeg;base64...)

keller31 avatar Apr 28 '25 07:04 keller31

After reading some documents, I found a solution. Using the keep_data_uris parameter allows md to retain the base64 content of the image.

keller31 avatar Apr 28 '25 07:04 keller31

example: markitdown xxx.docx > xxx.md --keep-data-uris

keller31 avatar Apr 28 '25 07:04 keller31

there is pr https://github.com/microsoft/markitdown/pull/277 looking to address this. I'm keen to get some code in to merge this functionality; it seems pretty important to me. Will try and have a look at getting code in for this this week; if you can provide any further review on that pr #277, i'll try and fork and address issues.

joshjm avatar Apr 28 '25 11:04 joshjm

Personally, i have a post processing step, that greps through for the base64 data, generates a description, then replaces the binary data with the description. its a little fast and loose right now, but has potential.

joshjm avatar Apr 28 '25 11:04 joshjm