markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

Add OCR fallback for scanned/non-searchable PDFs (#1156)

Open Sghosh1999 opened this issue 7 months ago • 3 comments

Description

Added OCR support to the PDF converter to handle scanned and non-searchable PDF files. When a PDF does not contain extractable text, the converter will now use OCR (via pytesseract and pdf2image) to extract text content from the PDF images.

Changes

  • Updated PdfConverter to first attempt text extraction with pdfminer as before.
  • If no text is found, the converter falls back to OCR using pytesseract and pdf2image.
  • Added clear error messages if OCR dependencies are missing.
  • Updated documentation/comments to include installation instructions for new dependencies.

Example Usage

from markitdown import MarkItDown

md = MarkItDown()
result = md.convert("scanned-document.pdf")
print(result.text_content)  # Will show OCR-extracted text if the PDF was not searchable

Related Issues

Closes #1156 — Pdf file conversion not working when pdf file is non scanable

Sghosh1999 avatar May 25 '25 15:05 Sghosh1999

@Sghosh1999 please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

@microsoft-github-policy-service agree

Sghosh1999 avatar May 25 '25 15:05 Sghosh1999

Thanks for the contribution. This looks promising. Let me do some testing.

NOTE: I'm not sure we should throw a dependency error if no text is found. What if the PDF just doesn't have text?

afourney avatar May 28 '25 16:05 afourney

Thanks for the contribution. This looks promising. Let me do some testing.

NOTE: I'm not sure we should throw a dependency error if no text is found. What if the PDF just doesn't have text?

I think this scenario will be rare , like mostly last page of pdf but 99% of the cases, pdf can be in non-extractive, like images/charts.

Sghosh1999 avatar May 30 '25 21:05 Sghosh1999

I had thought on very similar feature but leveraging an optional llm_client instead.

https://github.com/microsoft/markitdown/pull/1285

gjmveloso avatar Jun 06 '25 23:06 gjmveloso