unstructured issues

Results 188 unstructured issues

Sort by recently updated

bug/KeyError with PDF partition fast strategy element ID — in old_to_new_mapping[parent_id]

**Describe the bug** When partitioning [this](https://github.com/Unstructured-IO/unstructured/files/15109052/a1977-backus.pdf) PDF document with the `fast` strategy, the following `KeyError` occurs: ``` { "name": "KeyError", "message": "'782eec119b3409ea1a0bc8abf8f059ac'", "stack": "--------------------------------------------------------------------------- KeyError Traceback (most recent call last)...

adieuadieu

bug

Text Extraction Issue: Greek Language PDFs Rendered with Incorrect Alphabet

**Describe the bug** I am evaluating the UnstructuredClient for processing PDF documents and am encountering an issue with the Greek language text extraction. When I attempt to extract text from...

DarioBernardo

bug

ocr

chore: Update unstructured-client

The pinned version of unstructured-client was changed from `>=0.15.1` to `

Coniferish

bug

Favor upsert over add + use element_id to prevent duplicates

Make chroma ingest pipeline idempotent :) @potter-potter

0xjgv

Extend write_config with additional metadata chroma.py

Allow users to set additional metadata values to expand on metadata filtering capabilities. Useful to narrow down the search scope with metadata filters. cc @potter-potter https://cookbook.chromadb.dev/core/filters/#metadata-filters

0xjgv

GPU isn't used

GPU is not utilized during the process!

abrahimzaman360

bug

feat/partition_metadata

**Is your feature request related to a problem? Please describe.** I need to be able to extract additional metadata from HTML documents. Specifically I would like to extract favicons and...

Falven

enhancement

html

bug/_convert_table_to_text index out of range

**Describe the bug** A list index out of range occurs in _convert_table_to_text during docx parsing. **To Reproduce** I was operating on 1360 docx files from this source: https://www.3gpp.org/ftp/Specs/latest/Rel-17 In the...

igoforth

bug

docx

feat/retain md image links

Unstructured doesn't currently retain markdown image links (like [this format](https://www.codecademy.com/resources/docs/markdown/images)). User wants to do document loading through Langchain with Unstructured and keep markdown image links.

shreyanid

enhancement

ValueError: Detected a JSON file that does not conform to the Unstructured schema. partition_json currently only processes serialized Unstructured output.

When trying to load json file using S3FileLoader which uses Unstructured to load files, it's showing this error : ValueError: Detected a JSON file that does not conform to the...

NicoleNL

unstructured
unstructured copied to clipboard

Metadata

bug/KeyError with PDF partition fast strategy element ID — in old_to_new_mapping[parent_id]

Text Extraction Issue: Greek Language PDFs Rendered with Incorrect Alphabet

chore: Update unstructured-client

Favor upsert over add + use element_id to prevent duplicates

Extend write_config with additional metadata chroma.py

GPU isn't used

feat/partition_metadata

bug/_convert_table_to_text index out of range

feat/retain md image links

ValueError: Detected a JSON file that does not conform to the Unstructured schema. partition_json currently only processes serialized Unstructured output.

← Metadata

Owner

Metadata

unstructured unstructured copied to clipboard

Metadata

← Metadata

Owner

Metadata

unstructured
unstructured copied to clipboard