docling icon indicating copy to clipboard operation
docling copied to clipboard

feat(html): Support in-line anchor tags in HTML texts

Open krrome opened this issue 7 months ago • 3 comments

Description

This PR adapts the HTML backend to read hrefs from anchor tags anywhere in the document. Hyperlinks that exist only for fractions of a given text item will be converted as in-line groups. In this PR currently only the handle_paragraph method of the backend supports the generation of in-line.

This approach can easily be extended to handle also <b> and <i> tags.

Changes

  • HTML backend: explicit implementation of recursive text extraction in order to capture all anchor tags no matter how deeply they are present in the HTML tree
  • Hyperlinks are converted into in-line groups for paragraphs in DoclingDocuments
  • Added additional tests and adapted existing tests to fit the newly parsed hyperlinks

Shortcomings:

For headings, list elements, figure captions etc. the hyperlink will be extended to the entire text, as it is unclear to me how I could create an in-line group for those cases. Also, Images currently don't support a hyperlink in DoclingDocument, hence it is not parsed / converted.

Related PRs:

https://github.com/docling-project/docling/pull/1402: Also reads anchor tags, but not inside paragraphs or any other "handled" tag, whereas the approach presented in this PR does recognise anchor tags no matter how deeply the tag is hidden inside the html tree. https://github.com/docling-project/docling/pull/1411: refactors a lot of the HTML backend to make it shorter and simpler. This PR here might have to be adapted in case https://github.com/docling-project/docling/pull/1411 gets merged first, but I don't see that as a big problem.

krrome avatar May 25 '25 16:05 krrome

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • [X] title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • [X] #approved-reviews-by >= 2

mergify[bot] avatar May 25 '25 16:05 mergify[bot]

@krrome Thanks for this valuable contribution.

As you pointed out, there is a PR with some refactoring and we would like to close it as soon as possible. The plan would be:

  1. PR #1411
  2. PR #1659 (this PR)
  3. New PR that parses styles like bold, italic, or underline, which are already supported by the Docling's schema.

ceberam avatar May 26 '25 07:05 ceberam

@ceberam great. Thank you for your quick reply. I will keep an eye on https://github.com/docling-project/docling/pull/1411.

krrome avatar May 26 '25 08:05 krrome

@krrome Are you still interested in pursuing this PR? We just unblocked the previous one, so it would be now a good time to rebase this PR and review it. Please, let us know if you need any support.

ceberam avatar Jul 22 '25 11:07 ceberam

@ceberam Thank you. Yes, I am happy to work on this. I'll rebase the branch etc.

krrome avatar Jul 22 '25 11:07 krrome

A small update on my progress, I have had time to look into rebasing, but since the PR was lying around for such a long time that the main branch has changed so much, that I will essentially have to rethink and re-implement everything. :/ I am still willing to push this through, but only if the PR can then be reviewed, merged/rejected in a timely manner. Please let me know if this likely possible, otherwise I will abandon this PR. Thank you

krrome avatar Jul 24 '25 09:07 krrome

@krrome we believe that the HTML backend needs to support more features, since these are already supported in other backends. So your effort would be highly appreciated and we will try to be respond as quickly as possible this time.

ceberam avatar Jul 24 '25 09:07 ceberam

DCO Check Passed

Thanks @krrome, all your commits are properly signed off. 🎉

github-actions[bot] avatar Jul 30 '25 11:07 github-actions[bot]

I re-implemented hyperlink parsing, but there are a few problems I have identified:

  • DoclingDocument.add_picture is missing a hyperlink tag
  • TableCell does not support hyperlinks
  • There is a bug converting inline groups of list elements - see test results
  • Paragraph elements used to be split by "\n" when accessing text through tag.text. There were no examples of those cases in the test-data and it gave wrong results for my hyperlink_01.html testfile, so I made sure that at least that one was converted to markdown correctly.

Also I will try to fix the git commit messages today so that these checks pass here.

krrome avatar Jul 30 '25 11:07 krrome

@ceberam FYI the PR is ready for review

krrome avatar Jul 30 '25 14:07 krrome

Codecov Report

:x: Patch coverage is 94.51220% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling/backend/html_backend.py 94.51% 9 Missing :warning:

:loudspeaker: Thoughts on this report? Let us know!

codecov[bot] avatar Jul 30 '25 18:07 codecov[bot]

Thanks @krrome for the re-implementation. Regarding your comments and observations:

DoclingDocument.add_picture is missing a hyperlink tag

Correct. At this point, I would suggest creating a TextItem with the hyperlink and attach it to the PictureItem through the caption argument of add_picture. Like in the current implementation with _emit_image, but adding the hyperlink.

TableCell does not support hyperlinks

Correct. We are in a process of defining a richer data model for table elements. I would suggest leaving the hyperlinks in table cells out on this PR.

There is a bug converting inline groups of list elements - see test results

Could you please indicate which test and with which test document it fails?

ceberam avatar Jul 30 '25 18:07 ceberam

@ceberam thank you for your comments.

DoclingDocument.add_picture is missing a hyperlink

I will implement a temporary solution as you suggested

There is a bug converting inline groups of list elements - see test results

It can be seen here: https://github.com/docling-project/docling/pull/1659/files#diff-7ed8d3634aaf1a2658d281d1858a55c3c24cf0747b9882bb62a9d3d3a6d812f2

But just as I was writing a bug Issue I noticed in the corresponding itxt that doc.add_list_item seems to put every new list item into its own group, so I will have to see if by avoiding that (if at all possible) I can solve the issue.

krrome avatar Jul 31 '25 05:07 krrome

@ceberam fantastic, thanks for the example with the inline group inside list item. I tackled all issues pointed out by you, except the original_url. I will respond to separately in your message.

I have also implemented the hyperlinked images as you have suggested, but I ran into following problems:

  1. the caption attribute of doc.add_picture accepts an inline group and I guess the Markdown can be rendered, but it can't be serialised or something like that. In any case the e2e tests fail. So I reverted to converting the inline group text into a single text elenent and assigning hyperlink to all of it, but
  2. then the markdown of the caption is rendered without hyperlinks. So that basically defeats the whole purpose of this hack.

I did not investigate further because this is anyways a hack until image hyperlinks are supported...

Regarding other tags like bold and italic:

This is a minor change to the code of this PR. I am happy to add that here. There is one thing to keep in mind though - in my experience (I am not a fronte-end dev) styles like bold/italic are often managed through CSS so I don't know how far reaching this addition would be, but it's definitly a low hanging fruit.

krrome avatar Aug 01 '25 09:08 krrome

I have also implemented the hyperlinked images as you have suggested, but I ran into following problems:

1. the caption attribute of doc.add_picture accepts an inline group and I guess the Markdown can be rendered, but it can't be serialised or something like that. In any case the e2e tests fail. So I reverted to converting the inline group text into a single text elenent and assigning hyperlink to all of it, but

2. then the markdown of the caption is [rendered without hyperlinks](https://github.com/docling-project/docling/pull/1659/files#diff-ea901afab539e9e995b5ad9c1c97c7d7cc4102bdd4ebf424769570c98fc4844c). So that basically defeats the whole purpose of this hack.

You can leave it this way for now until we clarify the hyperlinks on images (we may open a new issue based on your second point, since this is clearly a bug).

Regarding other tags like bold and italic:

This is a minor change to the code of this PR. I am happy to add that here. There is one thing to keep in mind though - in my experience (I am not a fronte-end dev) styles like bold/italic are often managed through CSS so I don't know how far reaching this addition would be, but it's definitly a low hanging fruit.

To keep it agile, I would rather do it in another PR. We already created an issue on this topic and you are more than welcome to take it over (#1845 ). I totally agree on your comment regarding CSS, but well, if we see tags like <b>, <i>, <s>, it is indeed a low hanging fruit to provide formatting.

ceberam avatar Aug 04 '25 06:08 ceberam

Cool, if I have understood correctly then there are no more open tasks on my side at the moment and we are just waiting for one more review, is that correct?

Also, I am happy to take https://github.com/docling-project/docling/issues/1845 as it fits nicely with the changes introduced by this PR, but I will wait until this PR has been merged.

krrome avatar Aug 04 '25 08:08 krrome

I think the only thing that's missing is a Review by @dolfim-ibm, is that correct? I think it would be nice if we could get this PR to a merge soon. Thanks :)

krrome avatar Aug 12 '25 05:08 krrome