docling Hyperlinks not identified in PDFs

Bug

When exporting PDF to markdown, hyperlinks are not extracted (only the display text of the hyperlink is shown) ...

Steps to reproduce

from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"  # PDF path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())  # output: "### Docling Technical Report[...]"

Docling version

2.15 ...

Python version

3.10.7 ...

Jan 29 '25 03:01 kevinmt24

This is also true for html and docx conversions. Same code as above but using local file in source.

Jan 30 '25 20:01 that0n3guy

Hi guys, Have you figured out a way to extract the Hyperlink, i am also facing the same issue. @PeterStaar-IBM, is there a way/work around that we can utilize to extract the information ?

Apr 08 '25 06:04 gformcreation

@gformcreation Yes, we are identifying them in the docling-parse, we will need to now propagate it to the DoclingDocument.

@cau-git @vagenas Can we look into this at the post-processing?

Apr 08 '25 07:04 PeterStaar-IBM

Thanks @PeterStaar-IBM for confirming that there's a way to have this, Will be waiting to try it out.

Apr 10 '25 15:04 gformcreation

Experiencing the same behaviour. Are you guys aiming to extract the hyperlinks appearing in text and images, or macros? Sometimes on academic papers, the ORCID for each author is found hyperlinked over a logo/macro. Would be so helpful to be able to get this data too.

A paper example: https://arxiv.org/pdf/2504.02024

Apr 15 '25 08:04 gbe3hunna

Hi @PeterStaar-IBM , thank you for your insightful comment on this issue. I appreciate the work being done here and wanted to check in to see if there have been any updates or further developments.

Apr 18 '25 07:04 gformcreation

Hi @PeterStaar-IBM, thanks for you contribution and comments above. When you have any update please share with us. Thanks!

Apr 28 '25 08:04 antonisprevenas

Will do!

Apr 28 '25 09:04 PeterStaar-IBM

Hi @PeterStaar-IBM Thanks for the whole project. The work being done is awesome! Is there any update on this feature?

May 12 '25 08:05 sbmyron

Hi, a similar issue arises when .docx document contain words in automatic numbering format. Docling-SmoDocling fails to convert these words. For example in the attached file: https://www.3gpp.org/ftp/tsg_ran/WG1_RL1/TSGR1_120b/Docs//R1-2501739.zip on page 3, conclusions the words Observations 1,2,3 and Proposals 1,2,3, 4 are in macros

Docling outputs this (all correct, expect missing the words in macros):

If this could be addressed will be great, as there are millions of such documents (telecommunication standards) that many are using and they need to be converted in .md format.

May 17 '25 10:05 alexshmmy

Hi Team, Wanted to check if we had any luck on this ?

May 23 '25 09:05 gformcreation

Hello everyone! I'm interested in this feature too. Any update on this?

Jul 18 '25 09:07 LDelPinoNT

I'd love the hyperlinks to be preserved too.

Aug 11 '25 09:08 Lodimup

Hi Team, I wanted to follow up and see if we had any luck on this?

Aug 29 '25 08:08 anjirkviiit

Hi @PeterStaar-IBM , I'm also facing the same issue with PDF to Markdown conversion. The hyperlinks are not preserved - only the display text shows up, but the actual URLs are lost. Could you please share any updates on the current status of this enhancement? Thanks!

Sep 05 '25 06:09 akshata-gawali