unstructured
unstructured copied to clipboard
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
**Summary** A few additional small, mechanical odds and ends required for PPTX image extraction. The big one is removing the leading underscore from `PptxPartitionerOptions` because now client code that implements...
**Describe the bug** User gets a `TesseractError` when processing a particular document. **To Reproduce** Code was an API call with a certain image-based document. **Expected behavior** Document processed successfully. **Environment...
The current docs do not specify that you don't dump the elements as JSON objects into the JSON file. It would be clearer, if you gave an example of the...
This pull requests adds support for Astra DB as a source connector, pulling in data from Astra DB for use in Unstructured and applying appropriate metadata.
This pull requests adds support for specifying the indexing options for various columns in Astra DB, allowing users to avoid a situation where long text columns are by-default indexed.
This PR attempts to fix a memory issue, which resulted in errors like this: https://github.com/Unstructured-IO/unstructured/issues/2931 The root cause seems to be in how ListItems are being combined, not in how...
`partition_docx()` does not seem to render field codes embedded in the document as text in the output. For example, a document I am working with has the 'subject' property inserted...
**Summary** The DOCX format allows a table row to start late and/or end early, meaning cells at the beginning or end of a row can be omitted. While there are...