marker icon indicating copy to clipboard operation
marker copied to clipboard

Images Missing in Markdown Output When Using `marker` Command for Multiple PDFs

Open Saketh-Chandra opened this issue 9 months ago • 2 comments

Description

When using the marker command to convert multiple PDFs to markdown in batch mode, the output markdown files include the extracted text but do not include images. However, when using the marker_single command to convert a single PDF, both text and images are included correctly in the output. This indicates a bug specific to the batch processing functionality of the marker command.

Environment

  • Operating System: Windows 11 Home Single Language
  • OS Version: 10.0.26100 Build 26100
  • System Model: ROG Strix G16 G614JIR
  • CPU: Intel(R) Core(TM) i9-14900HX (24 cores, 32 logical processors, 3.66 GHz)
  • GPU: NVIDIA GeForce RTX 4070 Laptop GPU
  • VRAM: 8.0 GB dedicated, 15.8 GB total
  • CUDA Version: 11.8 (as indicated by torch 2.6.0+cu118)
  • PyTorch Version: 2.6.0+cu118
  • Marker Version: v1.6.1

Steps to Reproduce

  1. Create a folder (e.g., input_folder) containing multiple PDFs with images (e.g., pdf1.pdf, pdf2.pdf).
  2. Run the marker command for batch conversion:
    marker --output_dir .\output_folder .\input_folder  --workers 4 
    
  3. Inspect the markdown files in output_folder. The text is present, but images are missing.
  4. For comparison, run the marker_single command on one of the PDFs:
    marker_single .\input_folder\pdf1.pdf --output_dir output_folder_single 
    
  5. The output from marker_single includes both text and images as expected.

Expected Behavior

  • The marker command should generate markdown files that include both text and images from the PDFs, consistent with the behavior of marker_single.

Actual Behavior

  • When using marker for batch processing, the markdown files contain only text, with no images included. In contrast, marker_single correctly includes both text and images when processing a single PDF.

Possible Cause

  • The issue might stem from how the marker command handles multiprocessing with CUDA-enabled systems. In the convert.py script, models are loaded in the main process and shared across worker processes, which may not properly support image extraction due to CUDA context requirements. The marker_single command, running in a single process, avoids this problem by loading and using models directly.

Additional Information

  • No error messages appear during the conversion; the process completes successfully but omits images.
  • The issue occurs consistently, regardless of the number of PDFs processed.
  • Hardware and software details (listed above) may help identify if this is specific to certain GPU or CUDA configurations.

Hardware and Software Details

  • GPU Model: NVIDIA GeForce RTX 4070 Laptop GPU
  • VRAM: 8.0 GB dedicated, 15.8 GB total
  • CUDA Version: 11.8 (as indicated by torch 2.6.0+cu118)
  • PyTorch Version: 2.6.0+cu118
  • Marker Version: v1.6.1

Saketh-Chandra avatar Mar 14 '25 21:03 Saketh-Chandra

Not sure where the bug is yet, but I observe that when modifying marker/scripts/convert.py by removing the global variable, the bug disappears.

        converter = converter_cls(
            config=config_dict,
            # artifact_dict=model_refs,
            artifact_dict=create_model_dict(),
            processor_list=config_parser.get_processors(),
            renderer=config_parser.get_renderer(),
            llm_service=config_parser.get_llm_service()
        )

Conclusion: something about torch multiprocessing for the model_dict not working on certain machines (possibly platform-dependent?)

Edit: this fixed it for me, and also explains the observed platform dependence.

    if settings.TORCH_DEVICE == "mps" or settings.TORCH_DEVICE_MODEL == "mps":
        model_dict = None
    else:
        model_dict = None
        # create_model_dict()
        # for k, v in model_dict.items():
        #     v.model.share_memory()

conjuncts avatar Mar 25 '25 04:03 conjuncts

I have the exactly the same behavior described by @Saketh-Chandra using marker CLI for multiple pdf. The images are not treated and the final rendering is very bad. It works fine with marker_single command. I tried some fix of @conjuncts but it did not solved the problem.

Operating System: Windows 11 Professionnel OS Version: 24H2 (1000.26100.54.0) System Model: Asus ROG Strix Z890-E CPU: Intel(R) Core(TM) Ultra 9 285K 3.70 GHz (24 cores) GPU: NVIDIA GeForce RTX 4060TI Laptop GPU VRAM: 16 GB CUDA Version: 11.8 (as indicated by torch 2.6.0+cu118) PyTorch Version: 2.6.0+cu118 Marker Version: v1.6.1

clevesim avatar Apr 06 '25 08:04 clevesim