feat(html): Support in-line anchor tags in HTML texts
Description
This PR adapts the HTML backend to read hrefs from anchor tags anywhere in the document. Hyperlinks that exist only for fractions of a given text item will be converted as in-line groups. In this PR currently only the handle_paragraph method of the backend supports the generation of in-line.
This approach can easily be extended to handle also <b> and <i> tags.
Changes
- HTML backend: explicit implementation of recursive text extraction in order to capture all anchor tags no matter how deeply they are present in the HTML tree
- Hyperlinks are converted into in-line groups for paragraphs in DoclingDocuments
- Added additional tests and adapted existing tests to fit the newly parsed hyperlinks
Shortcomings:
For headings, list elements, figure captions etc. the hyperlink will be extended to the entire text, as it is unclear to me how I could create an in-line group for those cases. Also, Images currently don't support a hyperlink in DoclingDocument, hence it is not parsed / converted.
Related PRs:
https://github.com/docling-project/docling/pull/1402: Also reads anchor tags, but not inside paragraphs or any other "handled" tag, whereas the approach presented in this PR does recognise anchor tags no matter how deeply the tag is hidden inside the html tree. https://github.com/docling-project/docling/pull/1411: refactors a lot of the HTML backend to make it shorter and simpler. This PR here might have to be adapted in case https://github.com/docling-project/docling/pull/1411 gets merged first, but I don't see that as a big problem.
Merge Protections
Your pull request matches the following merge protections and will not be merged until they are valid.
🟢 Enforce conventional commit
Wonderful, this rule succeeded.
Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
- [X]
title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:
🟢 Require two reviewer for test updates
Wonderful, this rule succeeded.
When test data is updated, we require two reviewers
- [X]
#approved-reviews-by >= 2
@krrome Thanks for this valuable contribution.
As you pointed out, there is a PR with some refactoring and we would like to close it as soon as possible. The plan would be:
- PR #1411
- PR #1659 (this PR)
- New PR that parses styles like bold, italic, or underline, which are already supported by the Docling's schema.
@ceberam great. Thank you for your quick reply. I will keep an eye on https://github.com/docling-project/docling/pull/1411.
@krrome Are you still interested in pursuing this PR? We just unblocked the previous one, so it would be now a good time to rebase this PR and review it. Please, let us know if you need any support.
@ceberam Thank you. Yes, I am happy to work on this. I'll rebase the branch etc.
A small update on my progress, I have had time to look into rebasing, but since the PR was lying around for such a long time that the main branch has changed so much, that I will essentially have to rethink and re-implement everything. :/ I am still willing to push this through, but only if the PR can then be reviewed, merged/rejected in a timely manner. Please let me know if this likely possible, otherwise I will abandon this PR. Thank you
@krrome we believe that the HTML backend needs to support more features, since these are already supported in other backends. So your effort would be highly appreciated and we will try to be respond as quickly as possible this time.
✅ DCO Check Passed
Thanks @krrome, all your commits are properly signed off. 🎉
I re-implemented hyperlink parsing, but there are a few problems I have identified:
- DoclingDocument.add_picture is missing a hyperlink tag
- TableCell does not support hyperlinks
- There is a bug converting inline groups of list elements - see test results
- Paragraph elements used to be split by "\n" when accessing text through tag.text. There were no examples of those cases in the test-data and it gave wrong results for my hyperlink_01.html testfile, so I made sure that at least that one was converted to markdown correctly.
Also I will try to fix the git commit messages today so that these checks pass here.
@ceberam FYI the PR is ready for review
Codecov Report
:x: Patch coverage is 94.51220% with 9 lines in your changes missing coverage. Please review.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| docling/backend/html_backend.py | 94.51% | 9 Missing :warning: |
:loudspeaker: Thoughts on this report? Let us know!
Thanks @krrome for the re-implementation. Regarding your comments and observations:
DoclingDocument.add_picture is missing a hyperlink tag
Correct. At this point, I would suggest creating a TextItem with the hyperlink and attach it to the PictureItem through the caption argument of add_picture. Like in the current implementation with _emit_image, but adding the hyperlink.
TableCell does not support hyperlinks
Correct. We are in a process of defining a richer data model for table elements. I would suggest leaving the hyperlinks in table cells out on this PR.
There is a bug converting inline groups of list elements - see test results
Could you please indicate which test and with which test document it fails?
@ceberam thank you for your comments.
DoclingDocument.add_picture is missing a hyperlink
I will implement a temporary solution as you suggested
There is a bug converting inline groups of list elements - see test results
It can be seen here: https://github.com/docling-project/docling/pull/1659/files#diff-7ed8d3634aaf1a2658d281d1858a55c3c24cf0747b9882bb62a9d3d3a6d812f2
But just as I was writing a bug Issue I noticed in the corresponding itxt that doc.add_list_item seems to put every new list item into its own group, so I will have to see if by avoiding that (if at all possible) I can solve the issue.
@ceberam fantastic, thanks for the example with the inline group inside list item. I tackled all issues pointed out by you, except the original_url. I will respond to separately in your message.
I have also implemented the hyperlinked images as you have suggested, but I ran into following problems:
- the caption attribute of doc.add_picture accepts an inline group and I guess the Markdown can be rendered, but it can't be serialised or something like that. In any case the e2e tests fail. So I reverted to converting the inline group text into a single text elenent and assigning hyperlink to all of it, but
- then the markdown of the caption is rendered without hyperlinks. So that basically defeats the whole purpose of this hack.
I did not investigate further because this is anyways a hack until image hyperlinks are supported...
Regarding other tags like bold and italic:
This is a minor change to the code of this PR. I am happy to add that here. There is one thing to keep in mind though - in my experience (I am not a fronte-end dev) styles like bold/italic are often managed through CSS so I don't know how far reaching this addition would be, but it's definitly a low hanging fruit.
I have also implemented the hyperlinked images as you have suggested, but I ran into following problems:
1. the caption attribute of doc.add_picture accepts an inline group and I guess the Markdown can be rendered, but it can't be serialised or something like that. In any case the e2e tests fail. So I reverted to converting the inline group text into a single text elenent and assigning hyperlink to all of it, but 2. then the markdown of the caption is [rendered without hyperlinks](https://github.com/docling-project/docling/pull/1659/files#diff-ea901afab539e9e995b5ad9c1c97c7d7cc4102bdd4ebf424769570c98fc4844c). So that basically defeats the whole purpose of this hack.
You can leave it this way for now until we clarify the hyperlinks on images (we may open a new issue based on your second point, since this is clearly a bug).
Regarding other tags like bold and italic:
This is a minor change to the code of this PR. I am happy to add that here. There is one thing to keep in mind though - in my experience (I am not a fronte-end dev) styles like bold/italic are often managed through CSS so I don't know how far reaching this addition would be, but it's definitly a low hanging fruit.
To keep it agile, I would rather do it in another PR. We already created an issue on this topic and you are more than welcome to take it over (#1845 ). I totally agree on your comment regarding CSS, but well, if we see tags like <b>, <i>, <s>, it is indeed a low hanging fruit to provide formatting.
Cool, if I have understood correctly then there are no more open tasks on my side at the moment and we are just waiting for one more review, is that correct?
Also, I am happy to take https://github.com/docling-project/docling/issues/1845 as it fits nicely with the changes introduced by this PR, but I will wait until this PR has been merged.
I think the only thing that's missing is a Review by @dolfim-ibm, is that correct? I think it would be nice if we could get this PR to a merge soon. Thanks :)