PPTX parsing: bullet points not grouped correctly under subheadings
Discussed in https://github.com/docling-project/docling/discussions/1323
Originally posted by harskuma April 8, 2025 While working with PPTX files, I came across a formatting issue that could use some enhancement. Specifically, when a slide contains multiple subheadings, each with their own bullet points, the parsed output doesn’t maintain the correct grouping of bullet points under their respective subheadings. current version : Docling 2.28.4
For example, consider a slide like this:
Currently, the extracted output looks like this:
As shown in the attached screenshot, all bullet points are getting grouped under the first subheading, and the second subheading appears without its associated content.
Suggested Enhancement: It would be helpful to enhance the PPTX parsing logic to:
- Maintain bullet point association with the correct subheading
- Possibly use text box position, text style, or slide structure hierarchy to infer grouping
The current PPTX parser in Docling processes each slide by walking through its shapes in visual order, but it does not infer or preserve any hierarchy between subheadings and bullet points. All bullet points detected within a shape are grouped together under the parent slide, without explicit association to nearby subheadings. This happens because the parser only checks for bullet markers or indentation and does not use spatial position, text style, or slide structure to link bullets to subheadings, which explains why all bullets end up under the first subheading and others are left empty. Improving this would require enhancing the parsing logic to analyze text box positions, styles, or grouping on the slide to better infer which bullets belong to which subheading. You can see the relevant code and logic in the MsPowerpointDocumentBackend.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other