docling icon indicating copy to clipboard operation
docling copied to clipboard

Hyperlinks not identified in PDFs

Open kevinmt24 opened this issue 11 months ago • 11 comments

Bug

When exporting PDF to markdown, hyperlinks are not extracted (only the display text of the hyperlink is shown) ...

Steps to reproduce

from docling.document_converter import DocumentConverter

source = "https://arxiv.org/pdf/2408.09869"  # PDF path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())  # output: "### Docling Technical Report[...]"

Docling version

2.15 ...

Python version

3.10.7 ...

kevinmt24 avatar Jan 29 '25 03:01 kevinmt24

This is also true for html and docx conversions. Same code as above but using local file in source.

that0n3guy avatar Jan 30 '25 20:01 that0n3guy

Hi guys, Have you figured out a way to extract the Hyperlink, i am also facing the same issue. @PeterStaar-IBM, is there a way/work around that we can utilize to extract the information ?

gformcreation avatar Apr 08 '25 06:04 gformcreation

@gformcreation Yes, we are identifying them in the docling-parse, we will need to now propagate it to the DoclingDocument.

@cau-git @vagenas Can we look into this at the post-processing?

PeterStaar-IBM avatar Apr 08 '25 07:04 PeterStaar-IBM

Thanks @PeterStaar-IBM for confirming that there's a way to have this, Will be waiting to try it out.

gformcreation avatar Apr 10 '25 15:04 gformcreation

Experiencing the same behaviour. Are you guys aiming to extract the hyperlinks appearing in text and images, or macros? Sometimes on academic papers, the ORCID for each author is found hyperlinked over a logo/macro. Would be so helpful to be able to get this data too.

A paper example: https://arxiv.org/pdf/2504.02024

gbe3hunna avatar Apr 15 '25 08:04 gbe3hunna

Hi @PeterStaar-IBM , thank you for your insightful comment on this issue. I appreciate the work being done here and wanted to check in to see if there have been any updates or further developments.

gformcreation avatar Apr 18 '25 07:04 gformcreation

Hi @PeterStaar-IBM, thanks for you contribution and comments above. When you have any update please share with us. Thanks!

antonisprevenas avatar Apr 28 '25 08:04 antonisprevenas

Will do!

PeterStaar-IBM avatar Apr 28 '25 09:04 PeterStaar-IBM

Hi @PeterStaar-IBM Thanks for the whole project. The work being done is awesome! Is there any update on this feature?

sbmyron avatar May 12 '25 08:05 sbmyron

Hi, a similar issue arises when .docx document contain words in automatic numbering format. Docling-SmoDocling fails to convert these words. For example in the attached file: https://www.3gpp.org/ftp/tsg_ran/WG1_RL1/TSGR1_120b/Docs//R1-2501739.zip on page 3, conclusions the words Observations 1,2,3 and Proposals 1,2,3, 4 are in macros

Image

Docling outputs this (all correct, expect missing the words in macros):

Image

If this could be addressed will be great, as there are millions of such documents (telecommunication standards) that many are using and they need to be converted in .md format.

alexshmmy avatar May 17 '25 10:05 alexshmmy

Hi Team, Wanted to check if we had any luck on this ?

gformcreation avatar May 23 '25 09:05 gformcreation

Hello everyone! I'm interested in this feature too. Any update on this?

LDelPinoNT avatar Jul 18 '25 09:07 LDelPinoNT

I'd love the hyperlinks to be preserved too.

Lodimup avatar Aug 11 '25 09:08 Lodimup

Hi Team, I wanted to follow up and see if we had any luck on this?

anjirkviiit avatar Aug 29 '25 08:08 anjirkviiit

Hi @PeterStaar-IBM , I'm also facing the same issue with PDF to Markdown conversion. The hyperlinks are not preserved - only the display text shows up, but the actual URLs are lost. Could you please share any updates on the current status of this enhancement? Thanks!

akshata-gawali avatar Sep 05 '25 06:09 akshata-gawali