docling icon indicating copy to clipboard operation
docling copied to clipboard

docling parsing for scanned pdfs wont detect white space between words.

Open hisan-ideamaker opened this issue 1 month ago • 9 comments

Bug

When i use docling and rapidocr to parse the pdfs. There are no spaces between words. so instead of this is the parsed response, it would be something like thisistheparsed response, at first i thought it was just one pdf, but happens for basically any or most of the pdf that are scanned, the quality is pretty clear, still i'm getting this error. I don't want to change to easyocr, cause that doesn't work with complex layouts as good as rapidocr.

financial_reports_8.pdf

Steps to reproduce

Nothing crazy here, just the standard code:

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.pipeline_options import (
    ThreadedPdfPipelineOptions, AcceleratorOptions, AcceleratorDevice
)
import time

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = RapidOcrOptions(
    backend="openvino",
)

accelerator_options = AcceleratorOptions(
        num_threads=6, device=AcceleratorDevice.CUDA
    )
pipeline_options.accelerator_options = accelerator_options



converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options,
        )
    }
)

start = time.time()
result = converter.convert("dataset/ocr/record1.pdf")
markdown = result.document.export_to_markdown()
duration = time.time()-start
print(duration)

Docling version

Docling Version: 2.62.0

Python version

Python Version: 3.9.12

sample parsed output:

notestoconsolidatedfinancialstatements

Note23:ContingentLiabilities

We have been named in various lawsuits and intend to vigorously defend ourselves. While we cannot predict the outcome of these lawsuits, we believe thesematterswillnot have amaterial adverse effecton ourfinancial position, resutts ofoperations or cash flows.

Cosmetics.We were originally named as a defendant along with other department store and specialty retailers in nine separate but virtually identical classactionlawsuitsfiled invariousSuperior CourtsoftheStateof Californiain May.June and July 1998thathave nowbeen consolidated in Marin County state court.InMay 2000,plaintiffs filed an amended complaint naming a number of manufacturers of cosmetics and fragrances and two other retailers as additional defendants.Plaintiffs'amended Store"cosmetics sold in department and specialty storeswas collusively controlledbytheretailerandmanufacturerdefendants inviolation of theCartwrightActand theCaliforniaUnfairCompetitionAct.

Plaintiffs seek treble damages and restitution in an unspecified amount, attorneys'feesandprejudgmentinterest,on behalfofaclassofall California residentswho purchased cosmetics and fragrancesforpersonal usefromanyofthedefendantsduring theperiodfouryearsprior to the filing of the amended complaint.Defendants,including us,have answered the amended complaint denying the allegations. The defendants have produced documents and responded to plaintiffs' other discovery requests, including providing witnesses for depositions.

Weentered intoasettlementagreementwiththeplaintiffsand theother defendants on July 16,2003.In furtherance of the settlementagreement, thecasewasrefiled in theUnitedStatesDistrict Courtfor the Northern District of California on behalf of a class of all persons who currently reside in theUnitedStatesand whopurchased"DepartmentStorecosmetics fromthedefendantsduringtheperiodMay29.1994throughJuly16, 2003.The Court hasgivenpreliminary approval to the settlement.A summarynoticeofclasscertificationand thetermsof thesettlementhas been disseminated to class members.Ahearing on whether the Court willgrantfinalapprovalof thesettlementisscheduledforJune8,2004. IfapprovedbytheCourt,thesettlementwillresultintheplaintiffs'claims and the claims of all class members being dismissed, with prejudice,in theirentirety.In connectionwith thesettlement agreement, the defendants attormeys'fees. Our share of the cost of the settlement will not have a material adverseeffectonourfinancialcondition.

WashingtonPublicTrustAdvocates.Inearly 2002,wewerenamed asone of 30 defendants in Washington Public Trust Advocates, ex rel., et al. v. City ofSpokane,etal.,filed intheSpokaneCountySuperiorCourt,Stateof Washington.Plaintiff is a not-for-profit corporation bringing claims on Authority.Theclaimsrelate to theRiverParkSquare Malland Garage Project in Spokane,Washington (the"Project"],which includes a Nordstrom store.The portion ofthe complaint applicable to us seeks to recover from us the amount of a Department of Housing and Urban Development (HUD") loanmade to the developerof theProject.Damagesare sought in theamountof$22.75million,oralesseramounttotheextentthatthe HuDloanproceedswereusedfor theconstructionofthestoreandnot as tenant improvements.Other portions of the complaint seek to invalidate bonds issued to finance the public parking garage serving the Project, terminate agreements between the City of Spokane and the developer of the Project, amounts.The Complaintalsoallegesbreach of fiduciary dutiesbyvarious defendants,including us, to the people of the City of Spokane regarding lack of disclosures concerning the developer and the Project. By order datedAugust 9,2002,the court granted our motion todismissus from that lawsuit.Plaintiff attempted to obtain directreviewby theWashingtonSupreme Courtwhich declined tohear thecase andreferredit to theWashington Court of Appeals. On May 20, 2003, the Washington Court of Appeals affirmed our dismissal.

Bug

I've heard very good reviews about docling, and i'm wondering if i am doing something wrong perhaps ? i looked through the issues, didn't find anything. either i'm unlucky enough to run into this issue or naive enough to get it wrong...

Please help, i really need this sorted for my project. (also attaching the pdfs so that you could run that and verify)

hisan-ideamaker avatar Nov 22 '25 20:11 hisan-ideamaker

This is a known issue with RapidOCR in Docling—especially when using the ONNX or OpenVINO backends, which can merge words due to poor segmentation. Docling does not post-process OCR output to fix whitespace; it simply uses the text returned by RapidOCR as-is. Here’s how you can improve word separation:

  • Switch to the torch backend for RapidOCR if possible. This backend is known to resolve the word-merging issue and supports GPU acceleration. Example:
    pipeline_options.ocr_options = RapidOcrOptions(
        backend="torch",
    )
    
  • Explicitly specify the detection, recognition, and classification model paths in RapidOcrOptions using the latest ONNX or Torch models (downloadable from Hugging Face or ModelScope). This can resolve segmentation issues with the ONNX/OpenVINO backends. Example:
    ocr_options = RapidOcrOptions(
        det_model_path="path/to/det.onnx",
        rec_model_path="path/to/rec.onnx",
        cls_model_path="path/to/cls.onnx",
        backend="onnxruntime",  # or "torch" if you have Torch models
    )
    
  • Set force_full_page_ocr=True and both bitmap_area_threshold=0 and picture_area_threshold=0 in PdfPipelineOptions to ensure OCR runs on all images:
    pipeline_options.force_full_page_ocr = True
    pipeline_options.bitmap_area_threshold = 0
    pipeline_options.picture_area_threshold = 0
    
  • For better compatibility with problematic PDFs, set the PDF backend to 'pypdfium2':
    pipeline_options.pdf_backend = 'pypdfium2'
    
  • If you must use OpenVINO, try updating to the latest models and check if RapidOCR exposes any segmentation/whitespace flags you can pass via rapidocr_params.

If you still see merged words, check for error logs or warnings during extraction for clues about misconfiguration or missing dependencies. More details and examples are available in this issue and this discussion.

Docling does not perform any extra whitespace post-processing; word separation is determined by the OCR engine and model used. Using the recommended settings above should resolve or significantly reduce the word-merging problem.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other

Chat with Dosu Open in Cursor Join Discord Share on X

dosubot[bot] avatar Nov 22 '25 20:11 dosubot[bot]

@dosu the backend option didn't work. got an error: ValueError: "PdfPipelineOptions" object has no field "pdf_backend"

hisan-ideamaker avatar Nov 22 '25 20:11 hisan-ideamaker

The error happens because pdf_backend is not a valid field in PdfPipelineOptions for Docling v2.62.0. To set the PDF backend (like pypdfium2), you need to pass the backend class to the PdfFormatOption when you create your DocumentConverter, not as a field in PdfPipelineOptions reference reference.

Here’s how you can do it:

from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend  # adjust import if needed

pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = RapidOcrOptions(backend="openvino")
pipeline_options.force_full_page_ocr = True
pipeline_options.bitmap_area_threshold = 0
pipeline_options.picture_area_threshold = 0

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(
            pipeline_options=pipeline_options,
            backend=PyPdfiumDocumentBackend  # <-- set backend here
        )
    }
)

For best OCR results on scanned PDFs, keep do_ocr=True, set force_full_page_ocr=True, and set both area thresholds to zero. If you still see merged words, try switching the RapidOCR backend to "torch" (if you have Torch models), or explicitly set the model paths in RapidOcrOptions reference.

Let me know if you need help with the import path for PyPdfiumDocumentBackend or with model selection for RapidOCR.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other

Chat with Dosu Open in Cursor Join Discord Share on X

dosubot[bot] avatar Nov 22 '25 20:11 dosubot[bot]

i tried all of that. none of it worked @dosu

hisan-ideamaker avatar Nov 22 '25 21:11 hisan-ideamaker

This is a tough one—your experience matches a known limitation: Docling takes the OCR output verbatim, and RapidOCR (especially with ONNX/OpenVINO backends) sometimes merges words due to poor segmentation. There’s no built-in whitespace restoration or post-processing in Docling, so tweaking backends, models, and area thresholds only helps if the underlying OCR engine segments words correctly—which, in your case, it doesn’t with any config you’ve tried (source, source).

You have two practical options:

  1. Try SuryaOCR as an alternative backend. It was recently integrated and may handle whitespace/segmentation better for your PDFs. You’ll need to install the plugin (pip install docling-surya) and use it like this:

    from docling_surya import SuryaOcrOptions
    from docling.datamodel.pipeline_options import PdfPipelineOptions
    pipeline_options = PdfPipelineOptions(
        do_ocr=True,
        ocr_model="suryaocr",
        allow_external_plugins=True,
        ocr_options=SuryaOcrOptions(lang=["en"]),
    )
    # ... rest of your DocumentConverter setup
    

    Example and details: see here.

  2. Add a custom post-processing step to restore whitespace after extraction. Since Docling doesn’t do this for you, you can process the Markdown/text output with a simple heuristic (e.g., using a word segmentation model or regular expressions). If you want to integrate this into Docling’s pipeline, you can subclass BaseOcrModel and register your own OCR engine that wraps RapidOCR and applies your fix (reference).

If you want a code template for post-processing or help wiring up SuryaOCR, let me know. If you try SuryaOCR or another engine and still see the same issue, it’s likely something about the PDF images themselves, and a custom fix will be needed.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other

Chat with Dosu Open in Cursor Join Discord Share on X

dosubot[bot] avatar Nov 22 '25 21:11 dosubot[bot]

@dolfim-ibm i'm gonna cc you, just in case it got lost in between issues :)

hisan-ideamaker avatar Nov 22 '25 21:11 hisan-ideamaker

@dosu i tried the same pdf with the official rapidocr's repo (not via docling), and it had no such issues. so clearly the issue has clearly something to do with docling....

hisan-ideamaker avatar Nov 22 '25 22:11 hisan-ideamaker

You're right—this is Docling-specific. The root cause is that Docling preprocesses PDF pages before passing them to RapidOCR: it crops and rescales images (using a 1.5× render-then-resize step), merges bitmap regions, and then hands those crops to RapidOCR. This preprocessing can change how text lines are segmented, which sometimes causes word merging or loss of whitespace, even if the same PDF works fine with standalone RapidOCR where you control the image input directly. Docling also does not post-process OCR output to restore whitespace—it just takes the OCR results as-is, so any segmentation issues in the crops will show up in your output (source, source).

To debug, you can enable Docling's debug image output to compare the actual crops it sends to RapidOCR with what you use in your standalone script. If you want a workaround, you can post-process the OCR output with a word segmentation model or a simple regex to heuristically restore spaces. For a more robust fix, you could subclass Docling's OCR model and add your own post-processing step, or use the plugin system to register a custom OCR engine—see the plugin docs: https://docling-project.github.io/docling/concepts/plugins/ (source).

If you want help with a post-processing script or want to share your standalone RapidOCR code for comparison, let me know!

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other

Chat with Dosu Open in Cursor Join Discord Share on X

dosubot[bot] avatar Nov 22 '25 22:11 dosubot[bot]

@cau-git gonna @PeterStaar-IBM (mentioning the maintainers, cause i didn't hear back from anyone yet) I have tried all the solutions listed, and it still won't work. I have this issue in production, and wanna get this solved soon. rn it's not working on any of the scanned documents. most of the words are merged together.

Big thanks in advance.

hisan-ideamaker avatar Nov 28 '25 12:11 hisan-ideamaker

@hisan-ideamaker it is likely a limitation of the RapidOCR performance with english/latin material in PP-OCR v5 models. You have the choice of going back to EasyOCR which was the previous default. None of EasyOCR or RapidOCR will affect the layout detection performance since this is an entirely different model's responsibility. The only thing Docling takes from OCR engines is word-level or line-level text cells. It also does not change if you use a different PDF backend since PDF backends only provide programmatic text layers and OCR complements them.

cau-git avatar Dec 02 '25 14:12 cau-git

@cau-git I run the exact same ocr models without docling, and i run into no issues. the output is fine, and there are no spacing issues.

hisan-ideamaker avatar Dec 04 '25 21:12 hisan-ideamaker