unstructured
unstructured copied to clipboard
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
**Describe the bug** When partitioning [this](https://github.com/Unstructured-IO/unstructured/files/15109052/a1977-backus.pdf) PDF document with the `fast` strategy, the following `KeyError` occurs: ``` { "name": "KeyError", "message": "'782eec119b3409ea1a0bc8abf8f059ac'", "stack": "--------------------------------------------------------------------------- KeyError Traceback (most recent call last)...
**Describe the bug** I am evaluating the UnstructuredClient for processing PDF documents and am encountering an issue with the Greek language text extraction. When I attempt to extract text from...
The pinned version of unstructured-client was changed from `>=0.15.1` to `
Make chroma ingest pipeline idempotent :) @potter-potter
Allow users to set additional metadata values to expand on metadata filtering capabilities. Useful to narrow down the search scope with metadata filters. cc @potter-potter https://cookbook.chromadb.dev/core/filters/#metadata-filters
**Is your feature request related to a problem? Please describe.** I need to be able to extract additional metadata from HTML documents. Specifically I would like to extract favicons and...
**Describe the bug** A list index out of range occurs in _convert_table_to_text during docx parsing. **To Reproduce** I was operating on 1360 docx files from this source: https://www.3gpp.org/ftp/Specs/latest/Rel-17 In the...
Unstructured doesn't currently retain markdown image links (like [this format](https://www.codecademy.com/resources/docs/markdown/images)). User wants to do document loading through Langchain with Unstructured and keep markdown image links.
When trying to load json file using S3FileLoader which uses Unstructured to load files, it's showing this error : ValueError: Detected a JSON file that does not conform to the...