markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

Converting Embedded image from Documents

Open FeuRicardo opened this issue 1 year ago • 3 comments

Pull Request

Description

This PR introduces the following changes:

  1. Initialization of New Attributes:

    • Added _mlm_client and _mlm_model attributes to the PptxConverter class, initialized using the kwargs dictionary.
  2. Handling of Image Shapes:

    • Integrated a new method _convert_image_to_markdown to handle the conversion of image shapes to markdown within the presentation slides processing loop.
  3. Handling of image within DataURI:

    • Integrated a new validation to identify DataURIs of the image type and, if the LLM model has been defined, converts the image to markdown.
  4. Addition of _convert_image_to_markdown Method:

    • Added a new method _convert_image_to_markdown to the PptxConverter class to convert image shapes to markdown format.

Related Issue

Link to the related issue (if any).

Motivation and Context

  • The new attributes _mlm_client and _mlm_model are required for additional functionality.
  • The _convert_image_to_markdown method improves the handling of image shapes by converting them to markdown format, enhancing the overall functionality of the PptxConverter class.
  • The new feature that identifying and converting image-type DataURIs improves handling of documents (such as .docx) that have embedded images, enhancing the overall functionality of the _CustomMarkdownify class and its dependents.

How Has This Been Tested?

  • [ ] Unit tests
  • [ ] Integration tests
  • [ X ] Manual testing

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce.

Screenshots (if appropriate):

Types of changes

  • [ ] Bug fix
  • [ X ] New feature
  • [ ] Breaking change
  • [ ] Documentation update

Checklist:

  • [ X ] My code follows the code style of this project.
  • [ ] My change requires a change to the documentation.
  • [ ] I have updated the documentation accordingly.
  • [ X ] I have added tests to cover my changes.
  • [ X ] All new and existing tests passed.
  • [ X ] The title of my pull request is a short description of the requested changes.

Additional Notes

This new feature reflects over .pptx, .docx and .html (including extends classes)

FeuRicardo avatar Dec 19 '24 21:12 FeuRicardo

please expand the pr description.

gagb avatar Dec 19 '24 22:12 gagb

please expand the pr description.

Pull Request

Description

This PR introduces the following changes:

  1. Initialization of New Attributes:

    • Added _mlm_client and _mlm_model attributes to the PptxConverter class, initialized using the kwargs dictionary.
  2. Handling of Image Shapes:

    • Integrated a new method _convert_image_to_markdown to handle the conversion of image shapes to markdown within the presentation slides processing loop.
  3. Addition of _convert_image_to_markdown Method:

    • Added a new method _convert_image_to_markdown to the PptxConverter class to convert image shapes to markdown format.

Related Issue

Link to the related issue (if any).

Motivation and Context

  • The new attributes _mlm_client and _mlm_model are required for additional functionality.
  • The _convert_image_to_markdown method improves the handling of image shapes by converting them to markdown format, enhancing the overall functionality of the PptxConverter class.

How Has This Been Tested?

  • [ ] Unit tests
  • [ ] Integration tests
  • [x] Manual testing

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce.

Screenshots (if appropriate):

Types of changes

  • [ ] Bug fix
  • [x] New feature
  • [ ] Breaking change
  • [ ] Documentation update

Checklist:

  • [x] My code follows the code style of this project.
  • [ ] My change requires a change to the documentation.
  • [ ] I have updated the documentation accordingly.
  • [x] I have added tests to cover my changes.
  • [x] All new and existing tests passed.
  • [x] The title of my pull request is a short description of the requested changes.

Additional Notes

Add any additional information or context.

FeuRicardo avatar Dec 20 '24 13:12 FeuRicardo

This is a mandate feature. When can this PR be merged?

thiner avatar Feb 05 '25 09:02 thiner