marker
marker copied to clipboard
[BUG: Breaking] Marker cannot process files with signature lines
๐งจ Describe the Bug
Marker fails on error1.pdf and error2.pdf, where there is a repetition of dots, be this in the TOC of one of the documents, or when there is a placeholder for a signature
๐ Input Document
error1.pdf error2.pdf success1.pdf
๐ค Output Trace / Stack Trace
Paste the complete stack trace or error output, if available.
Click to expand
marker_single --disable_image_extraction ~/dev/projects/asterias/seshat/test.pdf
๎ฐ ๏บ 09:36
Recognizing layout: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:00<00:00, 1.65it/s]
Running OCR Error Detection: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1/1 [00:00<00:00, 7.20it/s]
Detecting bboxes: 0it [00:00, ?it/s]
Detecting bboxes: 0it [00:00, ?it/s]
Recognizing tables: 0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/Users/jacquesg/tmp/marker-test/.venv/bin/marker_single", line 10, in <module>
sys.exit(convert_single_cli())
~~~~~~~~~~~~~~~~~~^^
File "/Users/jacquesg/tmp/marker-test/.venv/lib/python3.13/site-packages/click/core.py", line 1442, in __call__
return self.main(*args, **kwargs)
~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/Users/jacquesg/tmp/marker-test/.venv/lib/python3.13/site-packages/click/core.py", line 1363, in main
rv = self.invoke(ctx)
File "/Users/jacquesg/tmp/marker-test/.venv/lib/python3.13/site-packages/click/core.py", line 1226, in invoke
return ctx.invoke(self.callback, **ctx.params)
~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/jacquesg/tmp/marker-test/.venv/lib/python3.13/site-packages/click/core.py", line 794, in invoke
return callback(*args, **kwargs)
File "/Users/jacquesg/tmp/marker-test/.venv/lib/python3.13/site-packages/marker/scripts/convert_single.py", line 38, in convert_single_cli
rendered = converter(fpath)
File "/Users/jacquesg/tmp/marker-test/.venv/lib/python3.13/site-packages/marker/converters/pdf.py", line 193, in __call__
document = self.build_document(temp_path)
File "/Users/jacquesg/tmp/marker-test/.venv/lib/python3.13/site-packages/marker/converters/pdf.py", line 187, in build_document
processor(document)
~~~~~~~~~^^^^^^^^^^
File "/Users/jacquesg/tmp/marker-test/.venv/lib/python3.13/site-packages/marker/processors/table.py", line 126, in __call__
tables: List[TableResult] = self.table_rec_model(
~~~~~~~~~~~~~~~~~~~~^
[t["table_image"] for t in table_data],
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
batch_size=self.get_table_rec_batch_size(),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/Users/jacquesg/tmp/marker-test/.venv/lib/python3.13/site-packages/surya/table_rec/__init__.py", line 32, in __call__
return self.batch_table_recognition(images, batch_size)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
File "/Users/jacquesg/tmp/marker-test/.venv/lib/python3.13/site-packages/surya/table_rec/__init__.py", line 220, in batch_table_recognition
row_encoder_hidden_states = torch.stack(row_encoder_hidden_states)
RuntimeError: stack expects a non-empty TensorList
โ๏ธ Environment
Please fill in all relevant details:
- Marker version: 1.9.3
- Surya version: 0.16.7
- Python version: 3.13
- PyTorch version: 2.8.0
- Transformers version: 4.56.1
- Operating System: macOS and Linux
โ Expected Behavior
No exception.
๐ Command or Code Used
Paste the exact bash command or Python code you used to run Marker:
Click to expand
marker_single --disable_image_extraction ~/dev/projects/asterias/seshat/test.pdf
๐ Additional Context
Same result with and without --disable_image_extraction
Hi @jacquesg thanks for providing the detailed info and the PDFs. I'm unfortunately unable to reproduce this issue on my mac locally with marker 1.9.3 and surya 0.6.7. Are you able to share more info on if this is running on GPU or on macOS?
Hi, I'm unable to reproduce this locally after reinstalling the latest version.
However, I've found that the footnotes.pdf takes a disproportionate amount of time to process given it is only 4 pages in length.
Great to hear @jacquesg! This was an issue with torch on apple silicon. On later versions of surya, we pinned the table model (where the issue was occurring) to CPU on apple devices.
As for the runtime, can you share the logs, and more hardware details? I just tried on a H100 and the processing time is pretty in-line with other docs we generally test
Any updates on the slow processing time @jacquesg ?