markitdown Add OCR fallback for scanned/non-searchable PDFs (#1156)

Description

Added OCR support to the PDF converter to handle scanned and non-searchable PDF files. When a PDF does not contain extractable text, the converter will now use OCR (via pytesseract and pdf2image) to extract text content from the PDF images.

Changes

Updated PdfConverter to first attempt text extraction with pdfminer as before.
If no text is found, the converter falls back to OCR using pytesseract and pdf2image.
Added clear error messages if OCR dependencies are missing.
Updated documentation/comments to include installation instructions for new dependencies.

Example Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("scanned-document.pdf")
print(result.text_content)  # Will show OCR-extracted text if the PDF was not searchable

Related Issues

Closes #1156 — Pdf file conversion not working when pdf file is non scanable

May 25 '25 15:05 Sghosh1999

@Sghosh1999 please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree

May 25 '25 15:05 Sghosh1999

Thanks for the contribution. This looks promising. Let me do some testing.

NOTE: I'm not sure we should throw a dependency error if no text is found. What if the PDF just doesn't have text?

May 28 '25 16:05 afourney

Thanks for the contribution. This looks promising. Let me do some testing.

NOTE: I'm not sure we should throw a dependency error if no text is found. What if the PDF just doesn't have text?

I think this scenario will be rare , like mostly last page of pdf but 99% of the cases, pdf can be in non-extractive, like images/charts.

May 30 '25 21:05 Sghosh1999

I had thought on very similar feature but leveraging an optional llm_client instead.

https://github.com/microsoft/markitdown/pull/1285

Jun 06 '25 23:06 gjmveloso