pptx: improve list-item detection
Problem
partition_pptx() does not detect all bulleted-list items or any numbered-list items and does not capture list-level metadata (metadata.category_depth) from list items.
For example, this slide (pptx file attached):
should produce:
ListItem
ListItem
ListItem
ListItem
ListItem
ListItem
Title
but produces this instead:
Title
Title
Title
Title
NarrativeText
NarrativeText
Title
Solution
Extend _PptxPartitioner._is_bulleted_paragraph() into ._is_list_item() and include numbered list-items in addition to bulleted-list items.
Numbered-list items are indicated by ./a:rPr/a:buAutoNum rather than .../a:buChar. Also, bulleted-list items can be indicated by the presence of a:lstStyle in the text-frame with no other indication on the paragraph (default bullet-char etc. inherited from style hierarchy). Also account for a:buNone list-item exception.
Capture list-level to include in metadata.category_depth (ordinal, top depth = 0). Consider using ._list_level(paragraph: Paragraph) -> Optional[int] to detect list items, where None means "not a list item".
Context
- In PowerPoint, so-called "bullet-slides" are very common, surely the most common slide type.
- A bullet slide can contain both bulleted and numbered list-items as well as non-list-item paragraphs. All three can be mixed in the same list.
- Lists have multiple levels, i.e. top-level bullets, sub-bullets, etc.
- Numbered list-items are indicated differently than bulleted list-items and non-list-item paragraphs (which must be identified explicitly when they appear in a list-format text-frame).
- List-item status can be "inhereted" from the style hierarchy, in particular from the text-frame a paragraph appears in, such that the list-item paragraph has no direct indication that it displays a bullet or number. Detection that relies on such attributes at the paragraph level is unreliable.