unstructured
unstructured copied to clipboard
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
**Describe the bug** I'm trying to import UncategorizedText from unstructured.documents.elements so I can use it to later filter out what I need from the fully partitioned pdf document. All other...
Hi, When using auto partitioning to partition pdfs, is it possible to get ocr metadata (quality, used or not etc) when pdf parser falls back to ocr strategy?
``` from typing import Any from pydantic import BaseModel from unstructured.partition.pdf import partition_pdf raw_pdf_elements = partition_pdf( filename="some_pdf.pdf", extract_images_in_pdf=False, infer_table_structure=True, chunking_strategy="by_title", max_characters=4000, new_after_n_chars=3800, combine_text_under_n_chars=2000, image_output_dir_path=".", ) ``` Running this function with...
**Is your feature request related to a problem? Please describe.** Up until unstructured `0.10.27` it was possible to use the `fast` and `ocr_only` strategy without having `unstructured_inference` installed (which pulls...
**Describe the bug** Given a single column csv file (see one example as attached), it fails to parse it because of the failure of determining the delimiter. See https://github.com/Unstructured-IO/unstructured/blob/4096a38371bae062832b976dc7ebff4184b7991f/unstructured/partition/csv.py#L109 for...
**Describe the bug** Only element returned from partition is (unstructured.documents.html.HTMLTitle, 'Please enable JS and disable any ad blocker') **To Reproduce** ``` !pip install "unstructured[all-docs]" url = 'https://www.nytimes.com/2024/02/19/world/europe/navalny-letters-russia.html' from unstructured.partition.auto import...
**Describe the bug** pip compile is used to make sure all the dependencies are pinned with versions that guarantee should work together. Currently, the dependencies are dynamically generated via the...
**Describe the bug** When using `partition_via_api`, the file extension for `file_filename` supersedes the `content_type` that the user passes in. **To Reproduce** The following results in a `400` from the API...
Bumps [dorny/paths-filter](https://github.com/dorny/paths-filter) from 2 to 3. Release notes Sourced from dorny/paths-filter's releases. v3.0.0 What's Changed Update README.md: added real world usage example by @iamtodor in dorny/paths-filter#178 Update Node.js to version...
Bumps [peter-evans/create-pull-request](https://github.com/peter-evans/create-pull-request) from 5 to 6. Release notes Sourced from peter-evans/create-pull-request's releases. Create Pull Request v6.0.0 Behaviour changes The default values for author and committer have changed. See "What's new"...