unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

Results 188 unstructured issues
Sort by recently updated
recently updated
newest added

**Describe the bug** I'm trying to import UncategorizedText from unstructured.documents.elements so I can use it to later filter out what I need from the fully partitioned pdf document. All other...

bug

Hi, When using auto partitioning to partition pdfs, is it possible to get ocr metadata (quality, used or not etc) when pdf parser falls back to ocr strategy?

``` from typing import Any from pydantic import BaseModel from unstructured.partition.pdf import partition_pdf raw_pdf_elements = partition_pdf( filename="some_pdf.pdf", extract_images_in_pdf=False, infer_table_structure=True, chunking_strategy="by_title", max_characters=4000, new_after_n_chars=3800, combine_text_under_n_chars=2000, image_output_dir_path=".", ) ``` Running this function with...

bug
pdf

**Is your feature request related to a problem? Please describe.** Up until unstructured `0.10.27` it was possible to use the `fast` and `ocr_only` strategy without having `unstructured_inference` installed (which pulls...

enhancement

**Describe the bug** Given a single column csv file (see one example as attached), it fails to parse it because of the failure of determining the delimiter. See https://github.com/Unstructured-IO/unstructured/blob/4096a38371bae062832b976dc7ebff4184b7991f/unstructured/partition/csv.py#L109 for...

bug

**Describe the bug** Only element returned from partition is (unstructured.documents.html.HTMLTitle, 'Please enable JS and disable any ad blocker') **To Reproduce** ``` !pip install "unstructured[all-docs]" url = 'https://www.nytimes.com/2024/02/19/world/europe/navalny-letters-russia.html' from unstructured.partition.auto import...

bug
html

**Describe the bug** pip compile is used to make sure all the dependencies are pinned with versions that guarantee should work together. Currently, the dependencies are dynamically generated via the...

bug

**Describe the bug** When using `partition_via_api`, the file extension for `file_filename` supersedes the `content_type` that the user passes in. **To Reproduce** The following results in a `400` from the API...

bug

Bumps [dorny/paths-filter](https://github.com/dorny/paths-filter) from 2 to 3. Release notes Sourced from dorny/paths-filter's releases. v3.0.0 What's Changed Update README.md: added real world usage example by @​iamtodor in dorny/paths-filter#178 Update Node.js to version...

dependencies
github_actions

Bumps [peter-evans/create-pull-request](https://github.com/peter-evans/create-pull-request) from 5 to 6. Release notes Sourced from peter-evans/create-pull-request's releases. Create Pull Request v6.0.0 Behaviour changes The default values for author and committer have changed. See "What's new"...

dependencies
github_actions