haystack-core-integrations icon indicating copy to clipboard operation
haystack-core-integrations copied to clipboard

New Converter based on Markitdown from Microsoft

Open paulmartrencharpro opened this issue 11 months ago • 3 comments

Summary and motivation

Markitdown is a new open source library (MIT license) from Microsoft (https://github.com/microsoft/markitdown). It's a utility tool for converting various files to Markdown (e.g., for indexing, text analysis, etc.)

It presently supports:

  • PDF (.pdf)
  • PowerPoint (.pptx)
  • Word (.docx)
  • Excel (.xlsx)
  • Images (EXIF metadata, and OCR)
  • Audio (EXIF metadata, and speech transcription)
  • HTML (special handling of Wikipedia, etc.)
  • Various other text-based formats (csv, json, xml, etc.)

It runs locally. It tries to describes images locally or can call an AI to describes images.

My initial testing is that it works quite well on pdf, pptx, xlsx and docx.

We could use it to make a converter (pdf, pptx, xlsx or docx) -> Markdown -> Document

Detailed design

We can do something similar to PDFMinerToDocument.

  • take a file (pdf, pptx, xlsx or docx)
  • Call markitdown to convert it to markdown
  • then use MarkdownToDocument to convert to document

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

### Tasks
- [ ] The code is documented with docstrings and was merged in the `main` branch
- [ ] Docs are published at https://docs.haystack.deepset.ai/
- [ ] There is a Github workflow running the tests for the integration nightly and at every PR
- [ ] A label named like `integration:<your integration name>` has been added to this repo
- [ ] The [labeler.yml](https://github.com/deepset-ai/haystack-core-integrations/blob/main/.github/labeler.yml) file has been updated
- [ ] The package has been released on PyPI
- [ ] An integration tile has been added to https://github.com/deepset-ai/haystack-integrations
- [ ] The integration has been listed in the [Inventory section](https://github.com/deepset-ai/haystack-core-integrations#inventory) of this repo README
- [ ] There is an example available to demonstrate the feature
- [ ] The feature was announced through social media

paulmartrencharpro avatar Dec 17 '24 10:12 paulmartrencharpro

Thanks for sharing this idea @paulmartrencharpro ! Looks very interesting! As the Markitdown repo is only a month old and there is only a 0.0.1a2 pre-release, we'll need to see how much interest there is from the community in adding an integration for it. Maybe in the meantime someone in the community wants to build an integration. An alternative might be Docling, which was recently brought up in this discussion: https://github.com/deepset-ai/haystack/discussions/8614

julian-risch avatar Dec 17 '24 12:12 julian-risch

There was a 0.0.2 release of markitdown last week.

julian-risch avatar Mar 14 '25 16:03 julian-risch

Hello @julian-risch I am interested in contributing to this. My best understanding so far says I need to create a repo of my own that'd implement the Markitdown converter. Is it correct? Appreciate your thoughts before I start building this. :)

srishti-git1110 avatar Jul 13 '25 15:07 srishti-git1110