RapidOcr causes merging of text while parsing.
Bug
While extracting text from several PDFs using Docling, most words are merged into a single entity without spaces. ...
Steps to reproduce
Extracted text:
Docling version
2.32.0 ...
Python version
3.11.9 ...
Hi @vishaldasnewtide, to allow proper reproduction please provide the original input file and conversion settings.
Hi @cau-git, Sure
def _initialize_converter(self) -> None:
"""Initialize the document converter with appropriate settings."""
pdf_pipeline_options = None
if self.optimize_pdf:
# Configure RapidOCR with GPU acceleration
pdf_pipeline_options = PdfPipelineOptions(ocr_options=RapidOcrOptions())
pdf_pipeline_options.do_ocr = self.enable_ocr
pdf_pipeline_options.do_table_structure = True
pdf_pipeline_options.table_structure_options.do_cell_matching = True
# Configure accelerator to use CUDA explicitly
pdf_pipeline_options.accelerator_options = AcceleratorOptions(
num_threads=self.pdf_threads,
device=AcceleratorDevice.AUTO,
cuda_use_flash_attention2=True
)
# Create converter instance with separate PDF and image options
self.converter = DoclingConverter(
allowed_formats=[
InputFormat.PDF,
InputFormat.IMAGE,
InputFormat.DOCX,
InputFormat.HTML,
InputFormat.PPTX,
InputFormat.ASCIIDOC,
InputFormat.CSV,
InputFormat.MD,
InputFormat.XLSX
],
format_options={
InputFormat.PDF: PdfFormatOption(
pipeline_cls=StandardPdfPipeline,
backend=PyPdfiumDocumentBackend,
pipeline_options=pdf_pipeline_options
),
InputFormat.DOCX: WordFormatOption(
pipeline_cls=SimplePipeline
),
InputFormat.IMAGE: ImageFormatOption(
pipeline_cls=StandardPdfPipeline,
backend=PyPdfiumDocumentBackend,
pipeline_options=pdf_pipeline_options
),
},
)
@cau-git Now I am facing this issue
"C:\Users\Admin\AppData\Local\pypoetry\Cache\virtualenvs\risingtidevectorsearch-ryepBKyc-py3.11\Lib\site-packages\docling_ibm_models\layoutmodel\layout_predictor.py", line 92,
in __init__ ).to(self._device) ^^^^^^^^^^^^^^^^
File "C:\Users\Admin\AppData\Local\pypoetry\Cache\virtualenvs\risingtidevectorsearch-ryepBKyc-py3.11\Lib\site-packages\transformers\modeling_utils.py",
line 3850, in to return super().to(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Admin\AppData\Local\pypoetry\Cache\virtualenvs\risingtidevectorsearch-ryepBKyc-py3.11\Lib\site-packages\torch\nn\modules\module.py", line 1343,
in to return self._apply(convert) ^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Admin\AppData\Local\pypoetry\Cache\virtualenvs\risingtidevectorsearch-ryepBKyc-py3.11\Lib\site-packages\torch\nn\modules\module.py", line 903,
in _apply module._apply(fn) File "C:\Users\Admin\AppData\Local\pypoetry\Cache\virtualenvs\risingtidevectorsearch-ryepBKyc-py3.11\Lib\site-packages\torch\nn\modules\module.py",
line 903, in _apply module._apply(fn)
File "C:\Users\Admin\AppData\Local\pypoetry\Cache\virtualenvs\risingtidevectorsearch-ryepBKyc-py3.11\Lib\site-packages\torch\nn\modules\module.py", line 903,
in _apply module._apply(fn) [Previous line repeated 4 more times]
File "C:\Users\Admin\AppData\Local\pypoetry\Cache\virtualenvs\risingtidevectorsearch-ryepBKyc-py3.11\Lib\site-packages\torch\nn\modules\module.py", line 930,
in _apply param_applied = fn(param) ^^^^^^^^^ File "C:\Users\Admin\AppData\Local\pypoetry\Cache\virtualenvs\risingtidevectorsearch-ryepBKyc-py3.11\Lib\site-packages\torch\nn\modules\module.py",
line 1336,
in convert raise NotImplementedError( NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty()
instead of torch.nn.Module.to() when moving module from meta to a different device.",
I am facing the exact same issue with RapidOCR. The words are merged together without any space between them. Haven't been able to figure out a fix yet.
@cau-git Any update on this?
Several members of my team also noticed this issue, with a more recent Docling version (2.36.1). We haven't yet tested with the latest release (2.51.0 as of this writing).
have you tested the new 3.x version with the V5 models?
@geoHeil Well It has been resolved in the earlier versions. The issue was specifically for the Rapid OCR models with V4. Now that V5 has arrived, the issue still exists for the default version, unless we specify the detection model with the Onnx path.
I've got the same issue. did anyone find a proper solution or even a workaround ?
@dosu , @dosubot do you have any solution ?
Yes, there are a few workarounds for the RapidOCR word merging issue in Docling:
- The problem is often caused by the ONNX backend (especially on GPU) producing poor segmentation, which merges words together. You can fix this by explicitly specifying the detection, recognition, and classification model paths using ONNX files in your RapidOcrOptions. This forces Docling to use the correct models and can resolve the word merging issue. See the example below for how to set this up in code:
from docling.datamodel.pipeline_options import PdfPipelineOptions, RapidOcrOptions
ocr_options = RapidOcrOptions(
det_model_path="path/to/your/det.onnx",
rec_model_path="path/to/your/rec.onnx",
cls_model_path="path/to/your/cls.onnx",
# Optionally, try backend="torch" if you have the Torch models
)
pipeline_options = PdfPipelineOptions(
ocr_options=ocr_options,
force_full_page_ocr=True, # This can help with extraction issues
bitmap_area_threshold=0,
picture_area_threshold=0,
)
You can download the latest RapidOCR ONNX models from Hugging Face or ModelScope and point to them as shown above. This approach is documented in the official example.
-
Some users report that switching the backend to "torch" (if you have the Torch models) or using EasyOCR instead of RapidOCR also resolves the issue, as EasyOCR defaults to Torch and does not have this merging problem source.
-
Setting
force_full_page_ocr=Trueand bothbitmap_area_thresholdandpicture_area_thresholdto 0 in yourPdfPipelineOptionscan help ensure OCR runs on all images and may improve results source. -
The issue is reportedly fixed in Paddle 3.4 with V5 models and RapidOCR, so using those models may also help source.
Let me know if you need a more detailed code example or help finding the right model files.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
@dosu i have the following questions:
- What's the difference between using torch vs onnx models ? answer it if i'm running it on gpu vs cpu.
- Could you give me the precise model versions that would solve the problem.
- I used onnx, if i switch to torch, how would that help ? wouldn't it make the system slow. (again answer it if i'm running it on gpu vs cpu).
give me the code with the best settings for gpu and cpu (separate).
Big thanks
what model version do you use? the new one or the old one?
@geoHeil I'm using the new ones, v5 for rapidocr.