pathway [QUESTION] How to enable OCR in private RAG?

Hello, I would like to enable OCR for PDF files which contain only scans of text data. Is there any simple way to do this? I am using a private RAG based on your example (private-rag). Thanks in advance for your help.

Feb 25 '25 14:02 rjakomin

Hey @rjakomin,

The quickest way may be to enable OCR for the Docling pipeline. You can try it on a single PDF page to see if the OCR quality is sufficient and to play around with the setup: https://ds4sd.github.io/docling/examples/full_page_ocr/ Then pass the correct Docling option to Pathway (let us know if this step causes any issues): https://pathway.com/developers/api-docs/pathway-xpacks-llm/parsers/#pathway.xpacks.llm.parsers.DoclingParser

One possible alternative is to perform OCR with a multimodal LLM. You can set this up in several ways, including also passing it as a DoclingParser option to Pathway. You'd have to benchmark local multimodal LLM's to pick one for your use case. I'm leaving just a relevant google search link here since category winners get outdated within weeks at the rate things are moving now https://www.google.com/search?q=best+vision+language+model+ollama+for+ocr (the model currently in the spotlight still seems to be "Llama 3.2 Vision").

Whatever OCR library or local multimodal LLM service works best for your text, integrating it into the Pathway pipeline is bound to be way easier than fixing an OCR with bad quality outcomes.

Feb 25 '25 18:02 dxtrous

Thanks. I used this config in app.yaml:

$parser: !pw.xpacks.llm.parsers.DoclingParser
  cache_strategy: !pw.udfs.DefaultCache
  pdf_pipeline_options:
    do_ocr: True
    do_table_structure: True
    do_cell_matching: True

It is a bit slow with large or many files, but it works.

Feb 28 '25 10:02 rjakomin

Hi, just wanted to offer a quick insight here —

Your setup looks correct YAML-wise, but OCR reliability in private RAG setups often fails silently when:

Docling parser never triggers OCR (due to file format fallback or missing MIME hint)

PDF rendered correctly, but no embedded text layer was detected — meaning the parser returned empty chunks

No post-verification step checks if the OCR fallback succeeded (which is common in both Docling and Langchain pipelines)

This is actually part of a larger issue we documented (ProblemMap No.1) — OCR not being enabled is one thing, but the pipeline not even realizing it failed is worse.

We ended up writing a full fallback module with visual feedback + PDF-to-image quality grading, and even hooked it to a lightweight in-context auto-correction mechanism when OCR fails. Bonus: it's fully pluggable into YAML workflows (like yours).

If you’re curious, we recently got a star from the creator of Tesseract.js (yes, the real one), backing the solution repo here: → https://github.com/bijection?tab=stars (we’re on top)

Let me know if you’d like a drop-in hook — MIT licensed, production-tested.

Aug 01 '25 05:08 onestardao