Hyperlinks not identified in PDFs
Bug
When exporting PDF to markdown, hyperlinks are not extracted (only the display text of the hyperlink is shown) ...
Steps to reproduce
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869" # PDF path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown()) # output: "### Docling Technical Report[...]"
Docling version
2.15 ...
Python version
3.10.7 ...
This is also true for html and docx conversions. Same code as above but using local file in source.
Hi guys, Have you figured out a way to extract the Hyperlink, i am also facing the same issue. @PeterStaar-IBM, is there a way/work around that we can utilize to extract the information ?
@gformcreation Yes, we are identifying them in the docling-parse, we will need to now propagate it to the DoclingDocument.
@cau-git @vagenas Can we look into this at the post-processing?
Thanks @PeterStaar-IBM for confirming that there's a way to have this, Will be waiting to try it out.
Experiencing the same behaviour. Are you guys aiming to extract the hyperlinks appearing in text and images, or macros? Sometimes on academic papers, the ORCID for each author is found hyperlinked over a logo/macro. Would be so helpful to be able to get this data too.
A paper example: https://arxiv.org/pdf/2504.02024
Hi @PeterStaar-IBM , thank you for your insightful comment on this issue. I appreciate the work being done here and wanted to check in to see if there have been any updates or further developments.
Hi @PeterStaar-IBM, thanks for you contribution and comments above. When you have any update please share with us. Thanks!
Will do!
Hi @PeterStaar-IBM Thanks for the whole project. The work being done is awesome! Is there any update on this feature?
Hi, a similar issue arises when .docx document contain words in automatic numbering format. Docling-SmoDocling fails to convert these words. For example in the attached file: https://www.3gpp.org/ftp/tsg_ran/WG1_RL1/TSGR1_120b/Docs//R1-2501739.zip on page 3, conclusions the words Observations 1,2,3 and Proposals 1,2,3, 4 are in macros
Docling outputs this (all correct, expect missing the words in macros):
If this could be addressed will be great, as there are millions of such documents (telecommunication standards) that many are using and they need to be converted in .md format.
Hi Team, Wanted to check if we had any luck on this ?
Hello everyone! I'm interested in this feature too. Any update on this?
I'd love the hyperlinks to be preserved too.
Hi Team, I wanted to follow up and see if we had any luck on this?
Hi @PeterStaar-IBM , I'm also facing the same issue with PDF to Markdown conversion. The hyperlinks are not preserved - only the display text shows up, but the actual URLs are lost. Could you please share any updates on the current status of this enhancement? Thanks!