marker icon indicating copy to clipboard operation
marker copied to clipboard

Large PDF files - getting stuck

Open rwilkhu opened this issue 1 year ago • 12 comments

Hello All,

Is there any solution to processing large PDF files - seems big files such is getting stuck (actually in reality the processing is taking too too much time) - seems as the size of the file increases the processing time seems to increase exponentially it seems.

Since the celery worker itself is processing but very slowly, its occupied in the task even if the timeout is triggered and the next request celery worker is not available due to this.

Please has someone solved this issue?

Thanks

rwilkhu avatar Jan 15 '25 21:01 rwilkhu

same issue

huelsgp27 avatar Jan 20 '25 07:01 huelsgp27

Can you give me some more details?

  • OS
  • Device you're using for the models (CPU, MPS, GPU)
  • RAM available
  • how many pages the PDF is (please share if you can)

Also anything you noticed around CPU usage.

VikParuchuri avatar Jan 24 '25 03:01 VikParuchuri

Can you give me some more details?

  • OS
  • Device you're using for the models (CPU, MPS, GPU)
  • RAM available
  • how many pages the PDF is (please share if you can)

Also anything you noticed around CPU usage.

Memory Issue with PDF to Markdown Conversion

Hi,

I’m experiencing the same memory issue. Here are my system details:

  • OS: Ubuntu
  • CPU: 16 vCore @ 2.3 GHz
  • RAM: 32 GB
  • Instance: OVHCloud c3-32 (Pricing & Specs)

Command Used

marker tempfile --workers 1 --output_dir markdown --output_format markdown --disable_image_extraction --languages "fr"

Memory Usage After Processing 280 PDFs

free -h total used free shared buff/cache available Mem: 30Gi 27Gi 3.0Gi 4.1Gi 4.4Gi 2.9Gi I need to process 400 PDFs (max 15 pages per PDF), but there seems to be a major memory leak. The issue persists even when using my Python script.

Python Script Extract


def convert_pdf_to_markdown(pdf_path, source_url):  
    config = {  
        "output_format": "markdown",  
        "languages": "fr",  
        "disable_image_extraction": True,  
    }  

    config_parser = ConfigParser(config)  
    converter = PdfConverter(  
        config=config_parser.generate_config_dict(),  
        artifact_dict=create_model_dict(),  
        processor_list=config_parser.get_processors(),  
        renderer=config_parser.get_renderer()  
    )  

    rendered = converter(pdf_path)  
    text, _, _ = text_from_rendered(rendered)  

Issue

After processing 4-5 PDFs, the script crashes due to an out-of-memory error, even though the machine has sufficient resources. This suggests a memory leak in the conversion process.

Any insights would be greatly appreciated!

I can send the archive privately if needed (190 MB raw PDF)

Thanks

FazCodeFR avatar Feb 14 '25 16:02 FazCodeFR

Will look into it this week

VikParuchuri avatar Feb 16 '25 13:02 VikParuchuri

@VikParuchuri, I've encountered a similar issue with a single large PDF file (over 150 pages) using the CPU configuration and default Gemini Flash model. Also, could we add a discussion section to the GitHub repo? It would be helpful for asking questions and sharing recommendations, like lightweight Ollama models compatible with Marker. Many would find this useful. Thanks for your support!

TeomanEgeSelcuk avatar Feb 19 '25 01:02 TeomanEgeSelcuk

The multiple file conversion issue is fixed if you set maxtasksperchild=1 in convert.py:

    with mp.Pool(processes=total_processes, initializer=worker_init, initargs=(model_dict,), maxtasksperchild=1) as pool:

I'll push in the next release

VikParuchuri avatar Feb 19 '25 02:02 VikParuchuri

On single long files, marker can use a lot of memory due to the fact that it needs to render images for every page, and possibly due to pydantic. I need to look more into this.

VikParuchuri avatar Feb 19 '25 02:02 VikParuchuri

ok thanks, also please do consider opening up discussions for this repository. Thanks for the help!

TeomanEgeSelcuk avatar Feb 19 '25 03:02 TeomanEgeSelcuk

I keep getting crashes as well.

My logs from my marker wrapper:

2025-04-02 07:36:14,982 - INFO - startup - Starting Marker-PDF API Service v1.0.0
2025-04-02 07:36:14,982 - INFO - startup - Environment: production
2025-04-02 07:36:14,982 - INFO - startup - Python version: 3.13.2
2025-04-02 07:36:14,999 - INFO - startup - Platform: macOS-15.3.2-arm64-arm-64bit-Mach-O
2025-04-02 07:36:14,999 - INFO - startup - Processor: arm
2025-04-02 07:36:14,999 - INFO - startup - Total memory: 48.0GB
2025-04-02 07:36:14,999 - INFO - startup - Available memory: 11.5GB
2025-04-02 07:36:15,011 - INFO - startup - GPU Available: True
2025-04-02 07:36:15,012 - INFO - startup - MPS (Apple Silicon) available: True
2025-04-02 07:36:15,012 - INFO - startup - MPS built: True
2025-04-02 07:36:15,012 - INFO - startup - Using Apple Silicon GPU acceleration
2025-04-02 07:36:15,013 - INFO - startup - Using device: mps
2025-04-02 07:36:15,013 - INFO - startup - Worker count: 2
2025-04-02 07:36:15,013 - INFO - startup - Worker timeout: 240s
2025-04-02 07:36:15,013 - INFO - startup - Force CPU: False
2025-04-02 07:36:15,013 - INFO - startup - Explicit torch device: mps
...

2025-04-02 07:38:54,452 - INFO - processor - Processing PDF with marker_single (Device: mps)
2025-04-02 07:38:54,452 - INFO - processor - Using marker_single at: .../marker-service-venv/bin/marker_single
2025-04-02 07:38:54,453 - INFO - processor - Starting PDF processing: 1743575934438-2021.08 - OVH Group - Consolidated Financial Statements_0.pdf, Size: 0.8MB
2025-04-02 07:38:54,454 - INFO - processor - Memory at start: RSS: 111.4MB, %: 0.2%, System: 53.9%
2025-04-02 07:38:54,454 - INFO - processor - PDF saved to temp file: /var/folders/4d/6hgb8vxn5fn9vgp6zt67255m0000gn/T/tmpy6bx7d60/upload.pdf, Size: 0.8MB
2025-04-02 07:38:54,455 - INFO - processor - Memory at before_marker: RSS: 111.5MB, %: 0.2%, System: 53.9%
2025-04-02 07:38:54,455 - INFO - processor - Starting marker_single with command: .../marker-service-venv/bin/marker_single /var/folders/4d/6hgb8vxn5fn9vgp6zt67255m0000gn/T/tmpy6bx7d60/upload.pdf --output_dir /var/folders/4d/6hgb8vxn5fn9vgp6zt67255m0000gn/T/tmp4ufnxspl --output_format markdown --debug --languages en --extract_images true --force_ocr
2025-04-02 07:38:54,456 - INFO - processor - Memory at before_marker_exec: RSS: 111.5MB, %: 0.2%, System: 53.9%

and then within 3-4 mins it crashes.

the pdf in question : https://corporate.ovhcloud.com/sites/default/files/2021-11/2021.08%20-%20OVH%20Group%20-%20Consolidated%20Financial%20Statements_0.pdf

for pdfs with less pages, it works for the same one, if I remove "force-ocr" it works. but, quality is bad, so I need the OCR for the tables.

any ideas?

I might try to split the pages in chunks, see if that works.

AntouanK avatar Apr 02 '25 06:04 AntouanK

Facing similar issue. Running on colab with T4 GPU. !marker --output_format markdown --disable_image_extraction --extract_images False --paginate_output --page_separator '++++++++++' /content/jp --output_dir /content/output --workers 3

It got stuck at some pdf, which i am guessing on a 185 page pdf, but when i ran that pdf with marker_single, it worked fine.

Image

Notice that the last success was at 10th min, and screenshot was taken at 30th minute and no progress

dilshans2k avatar Apr 14 '25 10:04 dilshans2k

I have found that the process will silently terminate if the pdf contains more than 1000 pages. I have resolved this by specifying the amount of pages to be a range of 1000 and then concatenating the results.

Facing similar issue. Running on colab with T4 GPU. !marker --output_format markdown --disable_image_extraction --extract_images False --paginate_output --page_separator '++++++++++' /content/jp --output_dir /content/output --workers 3

It got stuck at some pdf, which i am guessing on a 185 page pdf, but when i ran that pdf with marker_single, it worked fine.

Image

Notice that the last success was at 10th min, and screenshot was taken at 30th minute and no progress

can you please give me full colab code i have groq api i dont have any gemini api key

ankit8347 avatar Nov 10 '25 09:11 ankit8347