docling icon indicating copy to clipboard operation
docling copied to clipboard

PPTX parsing: bullet points not grouped correctly under subheadings

Open ceberam opened this issue 1 month ago • 1 comments

Discussed in https://github.com/docling-project/docling/discussions/1323

Originally posted by harskuma April 8, 2025 While working with PPTX files, I came across a formatting issue that could use some enhancement. Specifically, when a slide contains multiple subheadings, each with their own bullet points, the parsed output doesn’t maintain the correct grouping of bullet points under their respective subheadings. current version : Docling 2.28.4

For example, consider a slide like this: Screenshot 2025-04-08 at 9 00 29 AM

Currently, the extracted output looks like this: Screenshot 2025-04-08 at 9 04 07 AM

As shown in the attached screenshot, all bullet points are getting grouped under the first subheading, and the second subheading appears without its associated content.

Suggested Enhancement: It would be helpful to enhance the PPTX parsing logic to:

  • Maintain bullet point association with the correct subheading
  • Possibly use text box position, text style, or slide structure hierarchy to infer grouping

ceberam avatar Nov 20 '25 10:11 ceberam

The current PPTX parser in Docling processes each slide by walking through its shapes in visual order, but it does not infer or preserve any hierarchy between subheadings and bullet points. All bullet points detected within a shape are grouped together under the parent slide, without explicit association to nearby subheadings. This happens because the parser only checks for bullet markers or indentation and does not use spatial position, text style, or slide structure to link bullets to subheadings, which explains why all bullets end up under the first subheading and others are left empty. Improving this would require enhancing the parsing logic to analyze text box positions, styles, or grouping on the slide to better infer which bullets belong to which subheading. You can see the relevant code and logic in the MsPowerpointDocumentBackend.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

dosubot[bot] avatar Nov 20 '25 10:11 dosubot[bot]