Steve Canny
Steve Canny
**Problem** Chunk text begins mid-word when `overlap` is specified.  **Desired solution** Compute the overlap prefix as the next even-word boundary greater than or equal to `overlap` characters from the...
We want to Include images in the `docx` partitioner element stream. * The stream should include an element for each (qualified) image embedded in the document. * The image should...
**Summary** The indent-level for a bullet in DOCX is stored in the XML as an `int`. However, Word is tolerant of a floating-point value in that field and does not...
**Summary** A few additional small, mechanical odds and ends required for PPTX image extraction. The big one is removing the leading underscore from `PptxPartitionerOptions` because now client code that implements...
Ability to dynamically add to a table, especially adding a row, is very handy when pulling records from a datasource where the length of the list may vary.
**Summary** The DOCX format allows a table row to start late and/or end early, meaning cells at the beginning or end of a row can be omitted. While there are...
`element: Self` is used in multiple places in the stub for `etree._Element`. Here's a typical example: https://github.com/abelcheung/types-lxml/blob/main/lxml-stubs/etree/_element.pyi#L97 This produces over-narrowing of the type when used with custom element classes, like...
**Summary** Avoid `SyntaxWarning` and/or `SyntaxError` messages when importing `unstructured.nlp.patterns` by using raw strings (`"r"` prefix) for regex patterns which may contain `\x` character sequences not recognized by the Python parser...
### Problem `partition_pptx()` excludes "off-slide" shapes from partitioning. However, it only detects off-slide shapes that are to the left or above the slide. Shapes can also be off-slide to the...