unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

pptx: improve list-item detection

Open scanny opened this issue 2 years ago • 0 comments

Problem

partition_pptx() does not detect all bulleted-list items or any numbered-list items and does not capture list-level metadata (metadata.category_depth) from list items.

For example, this slide (pptx file attached): should produce:

ListItem
ListItem
ListItem
ListItem
ListItem
ListItem
Title

but produces this instead:

Title
Title
Title
Title
NarrativeText
NarrativeText
Title

Solution

Extend _PptxPartitioner._is_bulleted_paragraph() into ._is_list_item() and include numbered list-items in addition to bulleted-list items.

Numbered-list items are indicated by ./a:rPr/a:buAutoNum rather than .../a:buChar. Also, bulleted-list items can be indicated by the presence of a:lstStyle in the text-frame with no other indication on the paragraph (default bullet-char etc. inherited from style hierarchy). Also account for a:buNone list-item exception.

Capture list-level to include in metadata.category_depth (ordinal, top depth = 0). Consider using ._list_level(paragraph: Paragraph) -> Optional[int] to detect list items, where None means "not a list item".

Context

  • In PowerPoint, so-called "bullet-slides" are very common, surely the most common slide type.
  • A bullet slide can contain both bulleted and numbered list-items as well as non-list-item paragraphs. All three can be mixed in the same list.
  • Lists have multiple levels, i.e. top-level bullets, sub-bullets, etc.
  • Numbered list-items are indicated differently than bulleted list-items and non-list-item paragraphs (which must be identified explicitly when they appear in a list-format text-frame).
  • List-item status can be "inhereted" from the style hierarchy, in particular from the text-frame a paragraph appears in, such that the list-item paragraph has no direct indication that it displays a bullet or number. Detection that relies on such attributes at the paragraph level is unreliable.

numbered-list-items.pptx

scanny avatar Sep 20 '23 20:09 scanny