docling icon indicating copy to clipboard operation
docling copied to clipboard

feat: new HTML backend that handles styled html as well as images

Open vaaale opened this issue 8 months ago • 7 comments

  • Updated unit tests
  • Added documentation (Example notebook)

Note: MyPy fails. Seems to be a known issue with BeautifulSoup: https://github.com/python/typeshed/pull/13604

Checklist:

  • [x] Documentation has been updated, if necessary.
  • [x] Examples have been added, if necessary.
  • [x] Tests have been added, if necessary.

vaaale avatar Apr 17 '25 11:04 vaaale

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • [ ] #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • [X] title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

mergify[bot] avatar Apr 17 '25 11:04 mergify[bot]

Any updates on this? Anything I need to do?

vaaale avatar May 03 '25 08:05 vaaale

Any updates on this? Anything I need to do?

@vaaale we have done some changes recently. Could you please rebase to main and ensure that there are no conflicts in your branch?

ceberam avatar May 06 '25 16:05 ceberam

No conflicts

vaaale avatar May 08 '25 18:05 vaaale

No conflicts

@vaaale The branch has conflicts that prevent merging. Reviewing the PR is also more challenging without fixing those conflicts. Please see the section This branch has conflicts that must be resolved on this PR page.

ceberam avatar May 12 '25 07:05 ceberam

Dear @vaaale, gentle reminder: could you finalize this work so we can merge it in?

ceberam avatar May 19 '25 08:05 ceberam

Hi Cesar. Sorry for the slow reaction.

I think I have resolved the conflicts now, so everything should be in order.

Let me know if you need anything else.

Kind regards

On Mon, May 19, 2025 at 10:13 AM Cesar Berrospi Ramis < @.***> wrote:

ceberam left a comment (docling-project/docling#1411) https://github.com/docling-project/docling/pull/1411#issuecomment-2890063817

Dear @vaaale https://github.com/vaaale, gentle reminder: could you finalize this work so we can merge it in?

— Reply to this email directly, view it on GitHub https://github.com/docling-project/docling/pull/1411#issuecomment-2890063817, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASQ2PWOEMQRW5EZJDBMSUT27GHELAVCNFSM6AAAAAB3KWD526VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQOJQGA3DGOBRG4 . You are receiving this because you were mentioned.Message ID: @.***>

vaaale avatar May 19 '25 16:05 vaaale

Hi, is there anything I can do to help get this PR merged? I would quite like to continue work onhttps://github.com/docling-project/docling/pull/1659, but it is blocked until this PR has been merged. Best, Roman

krrome avatar Jun 13 '25 10:06 krrome

Hi, is there anything I can do to help get this PR merged? I would quite like to continue work onhttps://github.com//pull/1659, but it is blocked until this PR has been merged. Best, Roman

Hey Roman @krrome , thanks again for your collaboration on Docling project and sorry for keeping you on hold. We plan to unblock this PR early this week and as well as the other PRs that are directly related.

ceberam avatar Jun 23 '25 07:06 ceberam

@vaaale , we are finally getting back on this topic.

We did an extensive review of this PR. Since some new code has been merged since our last communication, we have taken ownership of the effort of rebasing the branch to the latest main version. However, we decided to abandon this effort for these reasons:

  • To include the image processing feature, the PR suggests a new constructor with an extra argument that is not compatible with the abstract class DeclarativeDocumentBackend. Such customization would require changes that go beyond the HTML backend.
  • The refactoring of the HTML backend's walk method provides a clearer and simpler code to read. However, this refactoring has removed a main feature in Docling library: the hierarchical structure of the parsed document. The proposed code attaches all the items (except list items) to a single item (the body), thus flattening the entire document hierarchy.

As an alternative we suggest the following:

  • Creating a new PR #1960 that addresses all the suggested changes of this PR except the image handling. However, note that some simplifications have been rolled back to address the document hierarchy and other corner cases. The new table parser has also been discarded since it did not handle pivot tables properly. Other minor changes have been applied (e.g., unnecessary tag checks and function arguments). I have added you as co-author.
  • Creating a new issue to address the image handling #1963
  • Closing this PR

Please, let us know if you have any further thoughts or remarks.

ceberam avatar Jul 21 '25 11:07 ceberam