Error when using marker_single with parameter "--disable_links"
Hi and thanks for your work.
Description of the process
- Conversion of PDF to markdown.
- Command used:
marker_single [pdf_file] --disable_image_extraction --output_format markdown --disable_links --output_dir . - Alternative method tried:
Based on the provided information, it appears that you are encountering a
KeyErrorwhile extracting text from a PDF file, specifically when the code attempts to access a key named'refs'in a dictionary that doesn't contain it. To address this issue, we can add error handling to ensure that we gracefully deal with situations where the expected data is not present.
Here’s a revised version of your original code to enhance its robustness:
from marker.config.parser import ConfigParser
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
def pdf_to_markdown(file: str) -> str:
config = {
"output_format": "markdown",
"disable_image_extraction": "true",
"disable_links": "true",
}
config_parser = ConfigParser(config)
converter = PdfConverter(
config=config_parser.generate_config_dict(),
artifact_dict=create_model_dict(),
renderer=config_parser.get_renderer(),
)
try:
rendered = converter(file)
output = rendered.markdown
return output
except KeyError as e:
print(f"Error occurred during PDF processing: {e}")
return ""
if __name__ == "__main__":
file = "file_location"
output = pdf_to_markdown(file)
print(output)
if output:
print(output)
else:
print("The output is empty due to an error in PDF processing.")
Expected behaviour I expected to get the markdown file(bash command)/string(python function) without HTML tags (e.g. ,
Current behaviour I ran into this error:
File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/providers/pdf.py", line 235, in pdftext_extraction
self.page_refs[page_id] = page["refs"]
~~~~^^^^^^^^
KeyError: 'refs'
COMPLETE LOG
root-IHT/testing-.venv-~/Debian/test - marker_single tests/resources/test_pdf/fox.pdf --disable_image_extraction --output_format markdown --disable_links --output_dir .
Loaded layout model s3://layout/2025_02_18 on device cuda with dtype torch.float16
Loaded texify model s3://texify/2025_02_18 on device cuda with dtype torch.float16
Loaded recognition model s3://text_recognition/2025_02_18 on device cuda with dtype torch.float16
Loaded table recognition model s3://table_recognition/2025_02_18 on device cuda with dtype torch.float16
Loaded detection model s3://text_detection/2025_02_18 on device cuda with dtype torch.float16
Loaded detection model s3://inline_math_detection/2025_02_24 on device cuda with dtype torch.float16
Traceback (most recent call last):
File "/root/Debian/test/.venv/bin/marker_single", line 8, in
sys.exit(convert_single_cli())
^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/click/core.py", line 1161, in call
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/click/core.py", line 1082, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/click/core.py", line 1443, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/click/core.py", line 788, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/scripts/convert_single.py", line 35, in convert_single_cli
rendered = converter(fpath)
^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/converters/pdf.py", line 151, in call
document = self.build_document(filepath)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/converters/pdf.py", line 140, in build_document
provider = provider_cls(filepath, self.config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/providers/pdf.py", line 94, in init
self.page_lines = self.pdftext_extraction(doc)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Debian/test/.venv/lib/python3.11/site-packages/marker/providers/pdf.py", line 235, in pdftext_extraction
self.page_refs[page_id] = page["refs"]
~~~~^^^^^^^^
KeyError: 'refs'
FIX PROPOSAL Here is a proposal to fix this issue.
Thanks in advance for your time.