unstructured
unstructured copied to clipboard
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
This pull request allows to return predictions in raw cell representation from table transformer. It will be later used to save prediction in a cells format for simpler metrics calculation....
Hi, Please find the below URL: https://unstructured-io.github.io/unstructured/ingest/source_connectors/wikipedia.html The explanation says "Connect **Airtable** to your preprocessing pipeline, and batch process all your documents using unstructured-ingest to store structured outputs locally on...
This pull request add metrics that are calculated based on table_as_cells instead of text_as_html. This change is required for comprehensive metrics calculation, as previously every colspan or rowspan predicted was...
**Describe the bug** pip install unstructured[pdf] results in ERROR: Could not build wheels for onnx, which is required to install pyproject.toml-based projects **To Reproduce** % pip install unstructured[pdf] **Expected behavior**...
**Describe the bug** Some spaces are removed from the text when partitioning a PDF document. **To Reproduce** PDF: [rok_20230930_1-1.pdf](https://github.com/Unstructured-IO/unstructured/files/15001636/rok_20230930_1-1.pdf) ``` elements = partition_pdf( filename="rok_20230930_1-1.pdf", strategy="hi_res", infer_table_structure=True, ) print(str(elements[20])) ``` **Current...
When attempting to execute `partition_doc` to pre-process multiple documents at the same time it fails by throwing the following error: ` PackageNotFoundError: Package not found at '/var/folders/p5/dljg1qv95y97dyq1c38xgb6r0000gn/T/tmp3nwg1qob/test.docx'` Here is a...
**Describe the bug** When I am importing the modules as below, I am getting the following error- ``` from unstructured.partition.html import partition_html from unstructured.partition.pptx import partition_pptx ``` `TypeError: add_chunking_strategy() missing...
**Describe the bug** Cannot use unstructured on MacOS M2 Pro because `from unstructured.partition.html import partition_html` throws ``` Traceback (most recent call last): File "", line 1, in File "/Users/liamvdv/src/github.com/REDACT/.venv/lib/python3.9/site-packages/unstructured/partition/html.py", line...
I am trying to use the Unstructured library locally using the Python 3.10.2 version. Everytime I try to import unstructured.partition.something, for example "from unstructured.partition.pdf import partition_pdf" the kernel dies. I...