marker icon indicating copy to clipboard operation
marker copied to clipboard

[BUG: Breaking] Marker cannot process files with signature lines

Open jacquesg opened this issue 2 months ago โ€ข 4 comments

๐Ÿงจ Describe the Bug

Marker fails on error1.pdf and error2.pdf, where there is a repetition of dots, be this in the TOC of one of the documents, or when there is a placeholder for a signature

๐Ÿ“„ Input Document

error1.pdf error2.pdf success1.pdf

footnotes.pdf

๐Ÿ“ค Output Trace / Stack Trace

Paste the complete stack trace or error output, if available.

Click to expand
marker_single --disable_image_extraction ~/dev/projects/asterias/seshat/test.pdf
                                                                ๎‚ฐ ๏บ 09:36
Recognizing layout: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:00<00:00,  1.65it/s]
Running OCR Error Detection: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1/1 [00:00<00:00,  7.20it/s]
Detecting bboxes: 0it [00:00, ?it/s]
Detecting bboxes: 0it [00:00, ?it/s]
Recognizing tables:   0%|                                                                                                                                                 | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/Users/jacquesg/tmp/marker-test/.venv/bin/marker_single", line 10, in <module>
    sys.exit(convert_single_cli())
             ~~~~~~~~~~~~~~~~~~^^
  File "/Users/jacquesg/tmp/marker-test/.venv/lib/python3.13/site-packages/click/core.py", line 1442, in __call__
    return self.main(*args, **kwargs)
           ~~~~~~~~~^^^^^^^^^^^^^^^^^
  File "/Users/jacquesg/tmp/marker-test/.venv/lib/python3.13/site-packages/click/core.py", line 1363, in main
    rv = self.invoke(ctx)
  File "/Users/jacquesg/tmp/marker-test/.venv/lib/python3.13/site-packages/click/core.py", line 1226, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/jacquesg/tmp/marker-test/.venv/lib/python3.13/site-packages/click/core.py", line 794, in invoke
    return callback(*args, **kwargs)
  File "/Users/jacquesg/tmp/marker-test/.venv/lib/python3.13/site-packages/marker/scripts/convert_single.py", line 38, in convert_single_cli
    rendered = converter(fpath)
  File "/Users/jacquesg/tmp/marker-test/.venv/lib/python3.13/site-packages/marker/converters/pdf.py", line 193, in __call__
    document = self.build_document(temp_path)
  File "/Users/jacquesg/tmp/marker-test/.venv/lib/python3.13/site-packages/marker/converters/pdf.py", line 187, in build_document
    processor(document)
    ~~~~~~~~~^^^^^^^^^^
  File "/Users/jacquesg/tmp/marker-test/.venv/lib/python3.13/site-packages/marker/processors/table.py", line 126, in __call__
    tables: List[TableResult] = self.table_rec_model(
                                ~~~~~~~~~~~~~~~~~~~~^
        [t["table_image"] for t in table_data],
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        batch_size=self.get_table_rec_batch_size(),
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/Users/jacquesg/tmp/marker-test/.venv/lib/python3.13/site-packages/surya/table_rec/__init__.py", line 32, in __call__
    return self.batch_table_recognition(images, batch_size)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "/Users/jacquesg/tmp/marker-test/.venv/lib/python3.13/site-packages/surya/table_rec/__init__.py", line 220, in batch_table_recognition
    row_encoder_hidden_states = torch.stack(row_encoder_hidden_states)
RuntimeError: stack expects a non-empty TensorList

โš™๏ธ Environment

Please fill in all relevant details:

  • Marker version: 1.9.3
  • Surya version: 0.16.7
  • Python version: 3.13
  • PyTorch version: 2.8.0
  • Transformers version: 4.56.1
  • Operating System: macOS and Linux

โœ… Expected Behavior

No exception.

๐Ÿ“Ÿ Command or Code Used

Paste the exact bash command or Python code you used to run Marker:

Click to expand
marker_single --disable_image_extraction ~/dev/projects/asterias/seshat/test.pdf

๐Ÿ“Ž Additional Context

Same result with and without --disable_image_extraction

jacquesg avatar Sep 15 '25 08:09 jacquesg

Hi @jacquesg thanks for providing the detailed info and the PDFs. I'm unfortunately unable to reproduce this issue on my mac locally with marker 1.9.3 and surya 0.6.7. Are you able to share more info on if this is running on GPU or on macOS?

zanussbaum avatar Sep 18 '25 02:09 zanussbaum

Hi, I'm unable to reproduce this locally after reinstalling the latest version.

However, I've found that the footnotes.pdf takes a disproportionate amount of time to process given it is only 4 pages in length.

jacquesg avatar Sep 20 '25 11:09 jacquesg

Great to hear @jacquesg! This was an issue with torch on apple silicon. On later versions of surya, we pinned the table model (where the issue was occurring) to CPU on apple devices.

As for the runtime, can you share the logs, and more hardware details? I just tried on a H100 and the processing time is pretty in-line with other docs we generally test

tarun-menta avatar Sep 23 '25 13:09 tarun-menta

Any updates on the slow processing time @jacquesg ?

tarun-menta avatar Sep 29 '25 14:09 tarun-menta