docling icon indicating copy to clipboard operation
docling copied to clipboard

RapidOcr causes merging of text while parsing.

Open vishaldasnewtide opened this issue 7 months ago • 14 comments

Bug

While extracting text from several PDFs using Docling, most words are merged into a single entity without spaces. ...

Steps to reproduce

Image

Extracted text:

Image

Docling version

2.32.0 ...

Python version

3.11.9 ...

vishaldasnewtide avatar May 22 '25 07:05 vishaldasnewtide

Hi @vishaldasnewtide, to allow proper reproduction please provide the original input file and conversion settings.

cau-git avatar May 22 '25 17:05 cau-git

AlexandriaObol.pdf

Hi @cau-git, Sure

def _initialize_converter(self) -> None:
        """Initialize the document converter with appropriate settings."""
        pdf_pipeline_options = None
        if self.optimize_pdf:
            # Configure RapidOCR with GPU acceleration
            pdf_pipeline_options = PdfPipelineOptions(ocr_options=RapidOcrOptions())
            pdf_pipeline_options.do_ocr = self.enable_ocr
            pdf_pipeline_options.do_table_structure = True
            pdf_pipeline_options.table_structure_options.do_cell_matching = True

            # Configure accelerator to use CUDA explicitly
            pdf_pipeline_options.accelerator_options = AcceleratorOptions(
                num_threads=self.pdf_threads,
                device=AcceleratorDevice.AUTO,
                cuda_use_flash_attention2=True
            )

        # Create converter instance with separate PDF and image options
        self.converter = DoclingConverter(
            allowed_formats=[
                InputFormat.PDF,
                InputFormat.IMAGE,
                InputFormat.DOCX,
                InputFormat.HTML,
                InputFormat.PPTX,
                InputFormat.ASCIIDOC,
                InputFormat.CSV,
                InputFormat.MD,
                InputFormat.XLSX
            ],
            format_options={
                InputFormat.PDF: PdfFormatOption(
                    pipeline_cls=StandardPdfPipeline,
                    backend=PyPdfiumDocumentBackend,
                    pipeline_options=pdf_pipeline_options
                ),
                InputFormat.DOCX: WordFormatOption(
                    pipeline_cls=SimplePipeline
                ),
                InputFormat.IMAGE: ImageFormatOption(
                    pipeline_cls=StandardPdfPipeline,
                    backend=PyPdfiumDocumentBackend,
                    pipeline_options=pdf_pipeline_options
                ),

            },
        )

vishaldasnewtide avatar May 23 '25 05:05 vishaldasnewtide

@cau-git Now I am facing this issue

"C:\Users\Admin\AppData\Local\pypoetry\Cache\virtualenvs\risingtidevectorsearch-ryepBKyc-py3.11\Lib\site-packages\docling_ibm_models\layoutmodel\layout_predictor.py", line 92, 
in __init__ ).to(self._device) ^^^^^^^^^^^^^^^^ 
File "C:\Users\Admin\AppData\Local\pypoetry\Cache\virtualenvs\risingtidevectorsearch-ryepBKyc-py3.11\Lib\site-packages\transformers\modeling_utils.py", 
line 3850, in to return super().to(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
File "C:\Users\Admin\AppData\Local\pypoetry\Cache\virtualenvs\risingtidevectorsearch-ryepBKyc-py3.11\Lib\site-packages\torch\nn\modules\module.py", line 1343, 
in to return self._apply(convert) ^^^^^^^^^^^^^^^^^^^^ 
File "C:\Users\Admin\AppData\Local\pypoetry\Cache\virtualenvs\risingtidevectorsearch-ryepBKyc-py3.11\Lib\site-packages\torch\nn\modules\module.py", line 903, 
in _apply module._apply(fn) File "C:\Users\Admin\AppData\Local\pypoetry\Cache\virtualenvs\risingtidevectorsearch-ryepBKyc-py3.11\Lib\site-packages\torch\nn\modules\module.py", 
line 903, in _apply module._apply(fn) 
File "C:\Users\Admin\AppData\Local\pypoetry\Cache\virtualenvs\risingtidevectorsearch-ryepBKyc-py3.11\Lib\site-packages\torch\nn\modules\module.py", line 903,
 in _apply module._apply(fn) [Previous line repeated 4 more times]
 File "C:\Users\Admin\AppData\Local\pypoetry\Cache\virtualenvs\risingtidevectorsearch-ryepBKyc-py3.11\Lib\site-packages\torch\nn\modules\module.py", line 930, 
 in _apply param_applied = fn(param) ^^^^^^^^^ File "C:\Users\Admin\AppData\Local\pypoetry\Cache\virtualenvs\risingtidevectorsearch-ryepBKyc-py3.11\Lib\site-packages\torch\nn\modules\module.py", 
 line 1336, 
 in convert raise NotImplementedError( NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() 
 instead of torch.nn.Module.to() when moving module from meta to a different device.",

vishaldasnewtide avatar May 23 '25 07:05 vishaldasnewtide

I am facing the exact same issue with RapidOCR. The words are merged together without any space between them. Haven't been able to figure out a fix yet.

vikasr111 avatar Jun 24 '25 03:06 vikasr111

@cau-git Any update on this?

vishaldasnewtide avatar Jun 24 '25 04:06 vishaldasnewtide

Several members of my team also noticed this issue, with a more recent Docling version (2.36.1). We haven't yet tested with the latest release (2.51.0 as of this writing).

anovstrup avatar Sep 05 '25 22:09 anovstrup

have you tested the new 3.x version with the V5 models?

geoHeil avatar Sep 06 '25 09:09 geoHeil

@geoHeil Well It has been resolved in the earlier versions. The issue was specifically for the Rapid OCR models with V4. Now that V5 has arrived, the issue still exists for the default version, unless we specify the detection model with the Onnx path.

vishaldasnewtide avatar Sep 08 '25 06:09 vishaldasnewtide

I've got the same issue. did anyone find a proper solution or even a workaround ?

hisan-ideamaker avatar Nov 19 '25 10:11 hisan-ideamaker

@dosu , @dosubot do you have any solution ?

hisan-ideamaker avatar Nov 19 '25 10:11 hisan-ideamaker

Yes, there are a few workarounds for the RapidOCR word merging issue in Docling:

  • The problem is often caused by the ONNX backend (especially on GPU) producing poor segmentation, which merges words together. You can fix this by explicitly specifying the detection, recognition, and classification model paths using ONNX files in your RapidOcrOptions. This forces Docling to use the correct models and can resolve the word merging issue. See the example below for how to set this up in code:
from docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions

ocr_options = RapidOcrOptions(
    det_model_path="path/to/your/det.onnx",
    rec_model_path="path/to/your/rec.onnx",
    cls_model_path="path/to/your/cls.onnx",
    # Optionally, try backend="torch" if you have the Torch models
)

pipeline_options = PdfPipelineOptions(
    ocr_options=ocr_options,
    force_full_page_ocr=True,  # This can help with extraction issues
    bitmap_area_threshold=0,
    picture_area_threshold=0,
)

You can download the latest RapidOCR ONNX models from Hugging Face or ModelScope and point to them as shown above. This approach is documented in the official example.

  • Some users report that switching the backend to "torch" (if you have the Torch models) or using EasyOCR instead of RapidOCR also resolves the issue, as EasyOCR defaults to Torch and does not have this merging problem source.

  • Setting force_full_page_ocr=True and both bitmap_area_threshold and picture_area_threshold to 0 in your PdfPipelineOptions can help ensure OCR runs on all images and may improve results source.

  • The issue is reportedly fixed in Paddle 3.4 with V5 models and RapidOCR, so using those models may also help source.

Let me know if you need a more detailed code example or help finding the right model files.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

dosubot[bot] avatar Nov 19 '25 10:11 dosubot[bot]

@dosu i have the following questions:

  1. What's the difference between using torch vs onnx models ? answer it if i'm running it on gpu vs cpu.
  2. Could you give me the precise model versions that would solve the problem.
  3. I used onnx, if i switch to torch, how would that help ? wouldn't it make the system slow. (again answer it if i'm running it on gpu vs cpu).

give me the code with the best settings for gpu and cpu (separate).

Big thanks

hisan-ideamaker avatar Nov 19 '25 10:11 hisan-ideamaker

what model version do you use? the new one or the old one?

geoHeil avatar Nov 19 '25 11:11 geoHeil

@geoHeil I'm using the new ones, v5 for rapidocr.

hisan-ideamaker avatar Nov 26 '25 11:11 hisan-ideamaker