Memory leak caused by EasyOCR
Bug
Hello. I have been experimenting with Docling for a while and am impressed by its performance. Everything runs well in my local environment. The problem is that when I ran the same codes in a container environment, the CPU memory kept increasing until it went OOM and the container killed itself. I have figured out the problem is a memory leak caused by the reader.readertxt function in EasyOCR, and a similar issue https://github.com/JaidedAI/EasyOCR/issues/815 was reported but unsolved under EasyOCR's repo.
Steps to reproduce
The piece of code I used is
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.ocr_options = EasyOcrOptions(force_full_page_ocr=True, lang=['es'])
pipeline_options.accelerator_options = AcceleratorOptions(
num_threads=8, device=AcceleratorDevice.CPU
)
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
conv_result = doc_converter.convert(input_doc_path)
My local environment is MacOS (M3) with 18G RAM. My container environment is Linux with 18G memory limit.
Docling version
2.25.2
Python version
3.12
Did you try the latest version of docling?
@ColeDrain Yes. The problem persists with 2.28.4
Hello, I am noticing a similar issue, when processing a lot of documents, the memory will often progressively go up and sometimes even suddenly spike. killing the container.
Same here. macOS 15.5, clean install of the latest version, easyOCR engine causes a memory leak on a single 5-page scanned PDF.
Same issues here.