unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

Results 188 unstructured issues
Sort by recently updated
recently updated
newest added

**Summary** A few additional small, mechanical odds and ends required for PPTX image extraction. The big one is removing the leading underscore from `PptxPartitionerOptions` because now client code that implements...

**Describe the bug** User gets a `TesseractError` when processing a particular document. **To Reproduce** Code was an API call with a certain image-based document. **Expected behavior** Document processed successfully. **Environment...

bug
ocr

The current docs do not specify that you don't dump the elements as JSON objects into the JSON file. It would be clearer, if you gave an example of the...

documentation
enhancement

This pull requests adds support for Astra DB as a source connector, pulling in data from Astra DB for use in Unstructured and applying appropriate metadata.

This pull requests adds support for specifying the indexing options for various columns in Astra DB, allowing users to avoid a situation where long text columns are by-default indexed.

This PR attempts to fix a memory issue, which resulted in errors like this: https://github.com/Unstructured-IO/unstructured/issues/2931 The root cause seems to be in how ListItems are being combined, not in how...

`partition_docx()` does not seem to render field codes embedded in the document as text in the output. For example, a document I am working with has the 'subject' property inserted...

enhancement
docx

**Summary** The DOCX format allows a table row to start late and/or end early, meaning cells at the beginning or end of a row can be omitted. While there are...