[BUG: Output] Embedded hyperlinks present in PDFs are lost in the final output.
π Describe the Output Issue
Embedded hyperlinks present in PDFs are lost in the final output.
External hyperlinks and email addresses embedded in PDFs are not preserved in the output. In both Markdown and HTML outputs, the visible link text appears, but the clickable hyperlink is missing.
When Marker parses PDF text, it reconstructs spans without consistently preserving each spanβs associated URL, especially for long anchors that wrap across multiple lines. As a result, the generated HTML/Markdown contains plain text instead of clickable links. Root cause: URLs are extracted from pdftext but downstream span creation and transformations rebuild spans without carrying over the url attribute, causing anchors to be dropped in the final HTML before Markdown conversion. This is most noticeable for long, multiline anchors where text is split into multiple spans/lines.
π Input Document
π€ Current Output
This is a very long hyperlink anchor text that will wrap across multiple lines in the PDF and should still be recognized as a single clickable link pointing to the target website.
Linkedin
Paste output here
β Expected Output
All available embedded links in clickable format
This is a very long hyperlink anchor text that will wrap across multiple lines in [the](https://long-anchor.example.com/path?query=1) PDF [and](https://long-anchor.example.com/path?query=1) [should](https://long-anchor.example.com/path?query=1) [still](https://long-anchor.example.com/path?query=1) be [recognized](https://long-anchor.example.com/path?query=1) as a [single](https://long-anchor.example.com/path?query=1) [clickable](https://long-anchor.example.com/path?query=1) [link](https://long-anchor.example.com/path?query=1) [pointing](https://long-anchor.example.com/path?query=1) to [the](https://long-anchor.example.com/path?query=1) [target](https://long-anchor.example.com/path?query=1) [website.](https://long-anchor.example.com/path?query=1)
[Linkedin](http://www.linkedin.com/in/jules-gerard-ai23)
βοΈ Environment
Please fill in all relevant details:
- Marker version: "marker-pdf[full]==1.8.5"
- Python version: "3.11"
@VikParuchuri Any idea on this issue.
Similar to the older issue: https://github.com/datalab-to/marker/issues/283 I have also noticed that with the current version 1.10.1 link extraction works mostly.
Though I have noitced that links inside table cells are not extracted. There only the plain text is returned. Can this be fixed with a custom processor or is it more underlying with table ocr etc.?