unstructured issues

rfctr(pptx): make PptxPartitionerOptions public

**Summary** A few additional small, mechanical odds and ends required for PPTX image extraction. The big one is removing the leading underscore from `PptxPartitionerOptions` because now client code that implements...

scanny

bug: TesseractError: Estimating resolution as X

1

**Describe the bug** User gets a `TesseractError` when processing a particular document. **To Reproduce** Code was an API call with a certain image-based document. **Expected behavior** Document processed successfully. **Environment...

qued

bug

ocr

Clarify `orig_elements` documentation

4

The current docs do not specify that you don't dump the elements as JSON objects into the JSON file. It would be clearer, if you gave an example of the...

Marcell-Balint

documentation

enhancement

feat: Astra DB Source Connector Support

4

This pull requests adds support for Astra DB as a source connector, pulling in data from Astra DB for use in Unstructured and applying appropriate metadata.

erichare

feat: Support Indexing options for Astra DB columns

10

This pull requests adds support for specifying the indexing options for various columns in Astra DB, allowing users to avoid a situation where long text columns are by-default indexed.

erichare

Quickfix for elements sharing the same memory address

2

This PR attempts to fix a memory issue, which resulted in errors like this: https://github.com/Unstructured-IO/unstructured/issues/2931 The root cause seems to be in how ListItems are being combined, not in how...

micmarty-deepsense

feat/docx-field-codes

`partition_docx()` does not seem to render field codes embedded in the document as text in the output. For example, a document I am working with has the 'subject' property inserted...

erik-squared

enhancement

docx

fix(docx): fix short-row DOCX table

1

**Summary** The DOCX format allows a table row to start late and/or end early, meaning cells at the beginning or end of a row can be omitted. While there are...

scanny

update unstructured-client and lxml requirements

1

Coniferish

chore CORE-4775: remove html page number metadata field

yuming-long

unstructured
unstructured copied to clipboard

Metadata

rfctr(pptx): make PptxPartitionerOptions public

bug: TesseractError: Estimating resolution as X

Clarify `orig_elements` documentation

feat: Astra DB Source Connector Support

feat: Support Indexing options for Astra DB columns

Quickfix for elements sharing the same memory address

feat/docx-field-codes

fix(docx): fix short-row DOCX table

update unstructured-client and lxml requirements

chore CORE-4775: remove html page number metadata field

← Metadata

Owner

Metadata

unstructured unstructured copied to clipboard

Metadata

← Metadata

Owner

Metadata

unstructured
unstructured copied to clipboard