Steve Canny
Steve Canny
### Problem `partition_pptx()` does not detect all bulleted-list items or any numbered-list items and does not capture list-level metadata (`metadata.category_depth`) from list items. For example, this slide (pptx file attached):...
### Context Perhaps 99.9% of PowerPoint slides include a dedicated title shape. The only built-in slide layout that does not provide a title shape is "Blank Slide" which would typically...
The docx format includes [Dublin Core](https://en.wikipedia.org/wiki/Dublin_Core) metadata in its `core.xml` "part". This metadata reliably includes a `modified` timestamp in ISO 8601 form, e.g. `2023-09-14T04:12:00Z`. Because this timestamp is contained in...
An author can embed one or more images in a Word document. Extract those during partitioning and include them in the element stream as an `Image` element if the partition...
Remedy disk-space leak where `partition_odt()` would leave an on-disk copy of each `.odt` file passed as a file-like object. `partition_odt()` creates a temporary file in which it writes each source-document...
Hi @wincent, thanks so much for this package! I swear I use this a hundred times a day and over the years I've gotten to where I just think of...
**Summary** Use more sophisticated algorithm for splitting oversized `Table` elements into `TableChunk` elements during chunking to ensure element text and HTML are "synchronized" and HTML is always parseable. **Additional Context**...
**Summary** Mechanical refactoring in preparation for adding (pre-chunk) `TableSplitter` in a PR stacked on this one.
**Summary** The contract of `partition_json()` is to "rehydrate" the JSON elements serialized to a JSON array of element objects. However, it changes the `element_id` and certain metadata fields from their...
**Describe the bug** When partitioning a JSON file using `partition()` and providing a `metadata_filename` argument that has a `.html` extension, the result is a single element with the entire JSON...