marker icon indicating copy to clipboard operation
marker copied to clipboard

[BUG: Output] Embedded hyperlinks present in PDFs are lost in the final output.

Open 2107Akanksha opened this issue 3 months ago β€’ 2 comments

πŸ“ Describe the Output Issue

Embedded hyperlinks present in PDFs are lost in the final output.

External hyperlinks and email addresses embedded in PDFs are not preserved in the output. In both Markdown and HTML outputs, the visible link text appears, but the clickable hyperlink is missing.

When Marker parses PDF text, it reconstructs spans without consistently preserving each span’s associated URL, especially for long anchors that wrap across multiple lines. As a result, the generated HTML/Markdown contains plain text instead of clickable links. Root cause: URLs are extracted from pdftext but downstream span creation and transformations rebuild spans without carrying over the url attribute, causing anchors to be dropped in the final HTML before Markdown conversion. This is most noticeable for long, multiline anchors where text is split into multiple spans/lines.

πŸ“„ Input Document

multiline_link.pdf

πŸ“€ Current Output

This is a very long hyperlink anchor text that will wrap across multiple lines in the PDF and should still be recognized as a single clickable link pointing to the target website.

Linkedin

Paste output here

βœ… Expected Output

All available embedded links in clickable format

Image
This is a very long hyperlink anchor text that will wrap across multiple lines in [the](https://long-anchor.example.com/path?query=1) PDF [and](https://long-anchor.example.com/path?query=1) [should](https://long-anchor.example.com/path?query=1) [still](https://long-anchor.example.com/path?query=1) be [recognized](https://long-anchor.example.com/path?query=1) as a [single](https://long-anchor.example.com/path?query=1) [clickable](https://long-anchor.example.com/path?query=1) [link](https://long-anchor.example.com/path?query=1) [pointing](https://long-anchor.example.com/path?query=1) to [the](https://long-anchor.example.com/path?query=1) [target](https://long-anchor.example.com/path?query=1) [website.](https://long-anchor.example.com/path?query=1)

[Linkedin](http://www.linkedin.com/in/jules-gerard-ai23)

βš™οΈ Environment

Please fill in all relevant details:

  • Marker version: "marker-pdf[full]==1.8.5"
  • Python version: "3.11"

2107Akanksha avatar Sep 01 '25 12:09 2107Akanksha

@VikParuchuri Any idea on this issue.

arunpkm avatar Sep 04 '25 19:09 arunpkm

Similar to the older issue: https://github.com/datalab-to/marker/issues/283 I have also noticed that with the current version 1.10.1 link extraction works mostly.

Though I have noitced that links inside table cells are not extracted. There only the plain text is returned. Can this be fixed with a custom processor or is it more underlying with table ocr etc.?

Yelinz avatar Oct 14 '25 09:10 Yelinz