feat: new HTML backend that handles styled html as well as images
- Updated unit tests
- Added documentation (Example notebook)
Note: MyPy fails. Seems to be a known issue with BeautifulSoup: https://github.com/python/typeshed/pull/13604
Checklist:
- [x] Documentation has been updated, if necessary.
- [x] Examples have been added, if necessary.
- [x] Tests have been added, if necessary.
Merge Protections
Your pull request matches the following merge protections and will not be merged until they are valid.
🔴 Require two reviewer for test updates
This rule is failing.
When test data is updated, we require two reviewers
- [ ]
#approved-reviews-by >= 2
🟢 Enforce conventional commit
Wonderful, this rule succeeded.
Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
- [X]
title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:
Any updates on this? Anything I need to do?
Any updates on this? Anything I need to do?
@vaaale we have done some changes recently. Could you please rebase to main and ensure that there are no conflicts in your branch?
No conflicts
No conflicts
@vaaale The branch has conflicts that prevent merging. Reviewing the PR is also more challenging without fixing those conflicts. Please see the section This branch has conflicts that must be resolved on this PR page.
Dear @vaaale, gentle reminder: could you finalize this work so we can merge it in?
Hi Cesar. Sorry for the slow reaction.
I think I have resolved the conflicts now, so everything should be in order.
Let me know if you need anything else.
Kind regards
On Mon, May 19, 2025 at 10:13 AM Cesar Berrospi Ramis < @.***> wrote:
ceberam left a comment (docling-project/docling#1411) https://github.com/docling-project/docling/pull/1411#issuecomment-2890063817
Dear @vaaale https://github.com/vaaale, gentle reminder: could you finalize this work so we can merge it in?
— Reply to this email directly, view it on GitHub https://github.com/docling-project/docling/pull/1411#issuecomment-2890063817, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASQ2PWOEMQRW5EZJDBMSUT27GHELAVCNFSM6AAAAAB3KWD526VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQOJQGA3DGOBRG4 . You are receiving this because you were mentioned.Message ID: @.***>
Hi, is there anything I can do to help get this PR merged? I would quite like to continue work onhttps://github.com/docling-project/docling/pull/1659, but it is blocked until this PR has been merged. Best, Roman
Hi, is there anything I can do to help get this PR merged? I would quite like to continue work onhttps://github.com//pull/1659, but it is blocked until this PR has been merged. Best, Roman
Hey Roman @krrome , thanks again for your collaboration on Docling project and sorry for keeping you on hold. We plan to unblock this PR early this week and as well as the other PRs that are directly related.
@vaaale , we are finally getting back on this topic.
We did an extensive review of this PR. Since some new code has been merged since our last communication,
we have taken ownership of the effort of rebasing the branch to the latest main version. However, we decided to abandon this effort for these reasons:
- To include the image processing feature, the PR suggests a new constructor with an extra argument that is not compatible with the abstract class
DeclarativeDocumentBackend. Such customization would require changes that go beyond the HTML backend. - The refactoring of the HTML backend's
walkmethod provides a clearer and simpler code to read. However, this refactoring has removed a main feature in Docling library: the hierarchical structure of the parsed document. The proposed code attaches all the items (except list items) to a single item (thebody), thus flattening the entire document hierarchy.
As an alternative we suggest the following:
- Creating a new PR #1960 that addresses all the suggested changes of this PR except the image handling. However, note that some simplifications have been rolled back to address the document hierarchy and other corner cases. The new table parser has also been discarded since it did not handle pivot tables properly. Other minor changes have been applied (e.g., unnecessary tag checks and function arguments). I have added you as co-author.
- Creating a new issue to address the image handling #1963
- Closing this PR
Please, let us know if you have any further thoughts or remarks.