docling icon indicating copy to clipboard operation
docling copied to clipboard

fix: Improve markdown list parser

Open tobiasstrebitzer opened this issue 9 months ago • 1 comments

Description of the changes:

  • Improve MarkdownDocumentBackend to set the correct list type to ordered or unordered
  • Improve MarkdownDocumentBackend to parse all child elements of a list item
  • Improve MarkdownDocumentBackend to fix exception when a list item doesn't contain any children.

Known limitations:

  • The markdown parser still needs work, I believe there's a vast amount of edge cases that are not currently handled by the parser.

Issue resolved by this Pull Request:

Resolves #913 Resolves #851

Checklist:

  • [x] Documentation has not been updated (no changes).
  • [x] Examples have not been added (not necessary).
  • [x] Tests have been added and truth data updated (ordered list example).
  • [x] Code style and formatting applied via pre-commit hook.

tobiasstrebitzer avatar Feb 23 '25 05:02 tobiasstrebitzer

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • [ ] #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • [X] title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

mergify[bot] avatar Feb 23 '25 05:02 mergify[bot]

Hi @tobiasstrebitzer

Thanks for your PR.

For some recent context, with docling-core v2.21.0, we introduced a couple relevant improvements:

  1. inline groups, for capturing things like the inline code in #913, i.e. groups whose contents are to be printed in single line
  2. Markdown export logic is more structured

Now I see you were initially targeting both #851 & #913 with this PR.

  • #913 is IMO relatively straightforward
  • #851 ~~on the other hand may still require some more minor conceptual work on our side to be properly addressed.~~

With all that in mind, and to keep PRs as local as possible, I propose focusing this PR only on #913 and solving that by using the newly introduced inline groups (example) as needed, i.e. differentiating between blocks and inline components (e.g. code block vs inline code) (assuming Marko makes this differentiation possible).

This would definitely be a nice feature to have!

Would you like to take it up since you already started working on this?

vagenas avatar Feb 27 '25 13:02 vagenas

@tobiasstrebitzer following up on my previous message, are you interested in taking up the #913 part as discussed above? Otherwise we'd close this as stale by end of week.

vagenas avatar Mar 17 '25 09:03 vagenas

Closing this issue, since it is tracked in #913 and new Markdown serializers in docling-core address the other aspects.

cau-git avatar Mar 25 '25 11:03 cau-git