marker icon indicating copy to clipboard operation
marker copied to clipboard

Error when using marker_single with parameter "--disable_links"

Open fatualux opened this issue 10 months ago • 0 comments

Hi and thanks for your work.

Description of the process

  • Conversion of PDF to markdown.
  • Command used: marker_single [pdf_file] --disable_image_extraction --output_format markdown --disable_links --output_dir .
  • Alternative method tried: Based on the provided information, it appears that you are encountering a KeyError while extracting text from a PDF file, specifically when the code attempts to access a key named 'refs' in a dictionary that doesn't contain it. To address this issue, we can add error handling to ensure that we gracefully deal with situations where the expected data is not present.

Here’s a revised version of your original code to enhance its robustness:

from marker.config.parser import ConfigParser
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict


def pdf_to_markdown(file: str) -> str:
    config = {
        "output_format": "markdown",
        "disable_image_extraction": "true",
        "disable_links": "true",
    }
    config_parser = ConfigParser(config)

    converter = PdfConverter(
        config=config_parser.generate_config_dict(),
        artifact_dict=create_model_dict(),
        renderer=config_parser.get_renderer(),
    )

    try:
    rendered = converter(file)
    output = rendered.markdown
    return output
    except KeyError as e:
        print(f"Error occurred during PDF processing: {e}")
        return ""

if __name__ == "__main__":
    file = "file_location"
    output = pdf_to_markdown(file)

    print(output)
    if output:
        print(output)
    else:
        print("The output is empty due to an error in PDF processing.")

Expected behaviour I expected to get the markdown file(bash command)/string(python function) without HTML tags (e.g. ,

Current behaviour I ran into this error:

File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/providers/pdf.py", line 235, in pdftext_extraction
    self.page_refs[page_id] = page["refs"]
                              ~~~~^^^^^^^^
KeyError: 'refs'
COMPLETE LOG root-IHT/testing-.venv-~/Debian/test - marker_single tests/resources/test_pdf/fox.pdf --disable_image_extraction --output_format markdown --disable_links --output_dir .

Loaded layout model s3://layout/2025_02_18 on device cuda with dtype torch.float16 Loaded texify model s3://texify/2025_02_18 on device cuda with dtype torch.float16 Loaded recognition model s3://text_recognition/2025_02_18 on device cuda with dtype torch.float16 Loaded table recognition model s3://table_recognition/2025_02_18 on device cuda with dtype torch.float16 Loaded detection model s3://text_detection/2025_02_18 on device cuda with dtype torch.float16 Loaded detection model s3://inline_math_detection/2025_02_24 on device cuda with dtype torch.float16 Traceback (most recent call last): File "/root/Debian/test/.venv/bin/marker_single", line 8, in sys.exit(convert_single_cli()) ^^^^^^^^^^^^^^^^^^^^ File "/root/Debian/test/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in call return self.main(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/Debian/test/.venv/lib/python3.11/site-packages/click/core.py", line 1082, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/root/Debian/test/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke return ctx.invoke(self.callback, **ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/Debian/test/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke return __callback(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/scripts/convert_single.py", line 35, in convert_single_cli rendered = converter(fpath) ^^^^^^^^^^^^^^^^ File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/converters/pdf.py", line 151, in call document = self.build_document(filepath) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/converters/pdf.py", line 140, in build_document provider = provider_cls(filepath, self.config) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/providers/pdf.py", line 94, in init self.page_lines = self.pdftext_extraction(doc) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/providers/pdf.py", line 235, in pdftext_extraction self.page_refs[page_id] = page["refs"] ~~~~^^^^^^^^ KeyError: 'refs'

FIX PROPOSAL Here is a proposal to fix this issue.

Thanks in advance for your time.

fatualux avatar Feb 26 '25 11:02 fatualux