Steve Canny issues

Results 28 issues of


                                            Steve Canny

pptx: improve list-item detection

### Problem `partition_pptx()` does not detect all bulleted-list items or any numbered-list items and does not capture list-level metadata (`metadata.category_depth`) from list items. For example, this slide (pptx file attached):...

enhancement

pptx

needs follow up

pptx: classify slide title as Title element

### Context Perhaps 99.9% of PowerPoint slides include a dedicated title shape. The only built-in slide layout that does not provide a title shape is "Blank Slide" which would typically...

pptx

needs follow up

docx: prefer document.core_properties.modified to filesystem last-modified

The docx format includes [Dublin Core](https://en.wikipedia.org/wiki/Dublin_Core) metadata in its `core.xml` "part". This metadata reliably includes a `modified` timestamp in ISO 8601 form, e.g. `2023-09-14T04:12:00Z`. Because this timestamp is contained in...

enhancement

docx

docx: include embedded images as Image elements

An author can embed one or more images in a Word document. Extract those during partitioning and include them in the element stream as an `Image` element if the partition...

enhancement

docx

needs follow up

fix(odt): fix disk-space leak in partition_odt()

Remedy disk-space leak where `partition_odt()` would leave an on-disk copy of each `.odt` file passed as a file-like object. `partition_odt()` creates a temporary file in which it writes each source-document...

Seems to have stopped working in Neovim 0.10

Hi @wincent, thanks so much for this package! I swear I use this a hundred times a day and over the years I've gotten to where I just think of...

feat(chunk): split tables on even row boundaries

**Summary** Use more sophisticated algorithm for splitting oversized `Table` elements into `TableChunk` elements during chunking to ensure element text and HTML are "synchronized" and HTML is always parseable. **Additional Context**...

rfctr(chunk): prep for adding TableSplitter

**Summary** Mechanical refactoring in preparation for adding (pre-chunk) `TableSplitter` in a PR stacked on this one.

bug(json): partition_json() does not preserve original element_id or metadata

**Summary** The contract of `partition_json()` is to "rehydrate" the JSON elements serialized to a JSON array of element objects. However, it changes the `element_id` and certain metadata fields from their...

bug

json

bug(json): partition() places entire JSON file into text of single element when `metadata_filename` has .html extension

**Describe the bug** When partitioning a JSON file using `partition()` and providing a `metadata_filename` argument that has a `.html` extension, the result is a single element with the entire JSON...

bug

json