Enhance _markitdown.py to support embedding images in markdown
This update enhances the DocxConverter class by adding functionality to extract and embed images when converting DOCX files to Markdown. The changes include:
convert_imagemethod: Handles image extraction, sanitizes the filename, and saves the image as a PNG in the specified output directory.convertmethod: Integrates theconvert_imagemethod with the Mammoth library's HTML conversion process, ensuring images are extracted and included in the final output.
@microsoft-github-policy-service agree
Looks promising. @MauroDruwel Can you please add some test cases?
Also, sanitizing filenames is a task that will come up a lot (and my already be implemented). I'm going to ask around for advice on how to handle this broadly and robustly (without regular expressions). Filenames may also have other restrictions (e.g., length etc.) on some OSs.
Hi @afourney,
I've added the following improvements:
- Test cases: I've included some test cases, which you can view in the result here.
- Filename formatting: I've replaced spaces with underscores in filenames to ensure compatibility.
- Length limit checker: A checker has been added to enforce a length limit for filenames.
- File existence check: Now the script checks if the file already exists and appends a counter to avoid overwriting.
- Alt text formatting: I've fixed an issue where newlines in the image
alt_textwere causing images not to show by removing them.
Let me know if you need any further adjustments!
Hi @afourney, is there anything left that I need to do?