Large PDF files - getting stuck
Hello All,
Is there any solution to processing large PDF files - seems big files such is getting stuck (actually in reality the processing is taking too too much time) - seems as the size of the file increases the processing time seems to increase exponentially it seems.
Since the celery worker itself is processing but very slowly, its occupied in the task even if the timeout is triggered and the next request celery worker is not available due to this.
Please has someone solved this issue?
Thanks
same issue
Can you give me some more details?
- OS
- Device you're using for the models (CPU, MPS, GPU)
- RAM available
- how many pages the PDF is (please share if you can)
Also anything you noticed around CPU usage.
Can you give me some more details?
- OS
- Device you're using for the models (CPU, MPS, GPU)
- RAM available
- how many pages the PDF is (please share if you can)
Also anything you noticed around CPU usage.
Memory Issue with PDF to Markdown Conversion
Hi,
I’m experiencing the same memory issue. Here are my system details:
- OS: Ubuntu
- CPU: 16 vCore @ 2.3 GHz
- RAM: 32 GB
- Instance: OVHCloud c3-32 (Pricing & Specs)
Command Used
marker tempfile --workers 1 --output_dir markdown --output_format markdown --disable_image_extraction --languages "fr"
Memory Usage After Processing 280 PDFs
free -h total used free shared buff/cache available Mem: 30Gi 27Gi 3.0Gi 4.1Gi 4.4Gi 2.9Gi
I need to process 400 PDFs (max 15 pages per PDF), but there seems to be a major memory leak. The issue persists even when using my Python script.
Python Script Extract
def convert_pdf_to_markdown(pdf_path, source_url):
config = {
"output_format": "markdown",
"languages": "fr",
"disable_image_extraction": True,
}
config_parser = ConfigParser(config)
converter = PdfConverter(
config=config_parser.generate_config_dict(),
artifact_dict=create_model_dict(),
processor_list=config_parser.get_processors(),
renderer=config_parser.get_renderer()
)
rendered = converter(pdf_path)
text, _, _ = text_from_rendered(rendered)
Issue
After processing 4-5 PDFs, the script crashes due to an out-of-memory error, even though the machine has sufficient resources. This suggests a memory leak in the conversion process.
Any insights would be greatly appreciated!
I can send the archive privately if needed (190 MB raw PDF)
Thanks
Will look into it this week
@VikParuchuri, I've encountered a similar issue with a single large PDF file (over 150 pages) using the CPU configuration and default Gemini Flash model. Also, could we add a discussion section to the GitHub repo? It would be helpful for asking questions and sharing recommendations, like lightweight Ollama models compatible with Marker. Many would find this useful. Thanks for your support!
The multiple file conversion issue is fixed if you set maxtasksperchild=1 in convert.py:
with mp.Pool(processes=total_processes, initializer=worker_init, initargs=(model_dict,), maxtasksperchild=1) as pool:
I'll push in the next release
On single long files, marker can use a lot of memory due to the fact that it needs to render images for every page, and possibly due to pydantic. I need to look more into this.
ok thanks, also please do consider opening up discussions for this repository. Thanks for the help!
I keep getting crashes as well.
My logs from my marker wrapper:
2025-04-02 07:36:14,982 - INFO - startup - Starting Marker-PDF API Service v1.0.0
2025-04-02 07:36:14,982 - INFO - startup - Environment: production
2025-04-02 07:36:14,982 - INFO - startup - Python version: 3.13.2
2025-04-02 07:36:14,999 - INFO - startup - Platform: macOS-15.3.2-arm64-arm-64bit-Mach-O
2025-04-02 07:36:14,999 - INFO - startup - Processor: arm
2025-04-02 07:36:14,999 - INFO - startup - Total memory: 48.0GB
2025-04-02 07:36:14,999 - INFO - startup - Available memory: 11.5GB
2025-04-02 07:36:15,011 - INFO - startup - GPU Available: True
2025-04-02 07:36:15,012 - INFO - startup - MPS (Apple Silicon) available: True
2025-04-02 07:36:15,012 - INFO - startup - MPS built: True
2025-04-02 07:36:15,012 - INFO - startup - Using Apple Silicon GPU acceleration
2025-04-02 07:36:15,013 - INFO - startup - Using device: mps
2025-04-02 07:36:15,013 - INFO - startup - Worker count: 2
2025-04-02 07:36:15,013 - INFO - startup - Worker timeout: 240s
2025-04-02 07:36:15,013 - INFO - startup - Force CPU: False
2025-04-02 07:36:15,013 - INFO - startup - Explicit torch device: mps
...
2025-04-02 07:38:54,452 - INFO - processor - Processing PDF with marker_single (Device: mps)
2025-04-02 07:38:54,452 - INFO - processor - Using marker_single at: .../marker-service-venv/bin/marker_single
2025-04-02 07:38:54,453 - INFO - processor - Starting PDF processing: 1743575934438-2021.08 - OVH Group - Consolidated Financial Statements_0.pdf, Size: 0.8MB
2025-04-02 07:38:54,454 - INFO - processor - Memory at start: RSS: 111.4MB, %: 0.2%, System: 53.9%
2025-04-02 07:38:54,454 - INFO - processor - PDF saved to temp file: /var/folders/4d/6hgb8vxn5fn9vgp6zt67255m0000gn/T/tmpy6bx7d60/upload.pdf, Size: 0.8MB
2025-04-02 07:38:54,455 - INFO - processor - Memory at before_marker: RSS: 111.5MB, %: 0.2%, System: 53.9%
2025-04-02 07:38:54,455 - INFO - processor - Starting marker_single with command: .../marker-service-venv/bin/marker_single /var/folders/4d/6hgb8vxn5fn9vgp6zt67255m0000gn/T/tmpy6bx7d60/upload.pdf --output_dir /var/folders/4d/6hgb8vxn5fn9vgp6zt67255m0000gn/T/tmp4ufnxspl --output_format markdown --debug --languages en --extract_images true --force_ocr
2025-04-02 07:38:54,456 - INFO - processor - Memory at before_marker_exec: RSS: 111.5MB, %: 0.2%, System: 53.9%
and then within 3-4 mins it crashes.
the pdf in question : https://corporate.ovhcloud.com/sites/default/files/2021-11/2021.08%20-%20OVH%20Group%20-%20Consolidated%20Financial%20Statements_0.pdf
for pdfs with less pages, it works for the same one, if I remove "force-ocr" it works. but, quality is bad, so I need the OCR for the tables.
any ideas?
I might try to split the pages in chunks, see if that works.
Facing similar issue. Running on colab with T4 GPU.
!marker --output_format markdown --disable_image_extraction --extract_images False --paginate_output --page_separator '++++++++++' /content/jp --output_dir /content/output --workers 3
It got stuck at some pdf, which i am guessing on a 185 page pdf, but when i ran that pdf with marker_single, it worked fine.
Notice that the last success was at 10th min, and screenshot was taken at 30th minute and no progress
I have found that the process will silently terminate if the pdf contains more than 1000 pages. I have resolved this by specifying the amount of pages to be a range of 1000 and then concatenating the results.
Facing similar issue. Running on colab with T4 GPU.
!marker --output_format markdown --disable_image_extraction --extract_images False --paginate_output --page_separator '++++++++++' /content/jp --output_dir /content/output --workers 3It got stuck at some pdf, which i am guessing on a 185 page pdf, but when i ran that pdf with marker_single, it worked fine.
Notice that the last success was at 10th min, and screenshot was taken at 30th minute and no progress
can you please give me full colab code i have groq api i dont have any gemini api key