Add OCR fallback for scanned/non-searchable PDFs (#1156)
Description
Added OCR support to the PDF converter to handle scanned and non-searchable PDF files. When a PDF does not contain extractable text, the converter will now use OCR (via pytesseract and pdf2image) to extract text content from the PDF images.
Changes
- Updated
PdfConverterto first attempt text extraction with pdfminer as before. - If no text is found, the converter falls back to OCR using pytesseract and pdf2image.
- Added clear error messages if OCR dependencies are missing.
- Updated documentation/comments to include installation instructions for new dependencies.
Example Usage
from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("scanned-document.pdf")
print(result.text_content) # Will show OCR-extracted text if the PDF was not searchable
Related Issues
Closes #1156 — Pdf file conversion not working when pdf file is non scanable
@Sghosh1999 please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]Options:
- (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
- (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"Contributor License Agreement
@microsoft-github-policy-service agree
Thanks for the contribution. This looks promising. Let me do some testing.
NOTE: I'm not sure we should throw a dependency error if no text is found. What if the PDF just doesn't have text?
Thanks for the contribution. This looks promising. Let me do some testing.
NOTE: I'm not sure we should throw a dependency error if no text is found. What if the PDF just doesn't have text?
I think this scenario will be rare , like mostly last page of pdf but 99% of the cases, pdf can be in non-extractive, like images/charts.
I had thought on very similar feature but leveraging an optional llm_client instead.
https://github.com/microsoft/markitdown/pull/1285