unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

Results 188 unstructured issues
Sort by recently updated
recently updated
newest added

**Describe the bug** I am using the partition_pdf function to extract tables from a PDF file publicly available (https://www.highmarkbcbswv.com/PDFFiles/ANSI-reason-codes.pdf). After running the OCR, the elements contain every row of the...

bug

In trying to load a JSON file (structured as below) with a call to `elements = partition(filename=f)`, I get the error message in the title. ``` [ { "key1": "val1",...

json

**Is your feature request related to a problem? Please describe.** Currently, element_id's are simply a hash of the element's text. This is not great, since id's may then be duplicated...

enhancement

![image](https://github.com/Unstructured-IO/unstructured/assets/74747729/e6d6c4c1-3dc9-42e3-9974-6dd1b65024f9) When saving a file using Outlook and selecting the (*.msg) format instead of Unicode format, the text loaded by UnstructuredFileLoader will appear as garbled characters. Like this: ``` [Document(page_content='\u0a0d\u0a0d톰宥\ueab8\ue6ae\u0a0dꆬ쪰\uedb7\ue9a4ㄨ㌱约ꐲ㋫\ue9a4꘩쉢ꖾꙂ쎳꾺뫇Ꟗ꩑ꓷꖧ외끗ꗏ엾ꛩꑐꆯൃഊ\u200a\u0a0d\u0a0d\ue2a9謁\ue8a4ꆦ\u0a0d\u0a0d떤约亱쒱캥疡욼\ueca6人涱皡䆡䎨䢤튬뎦䶱疡\ue2a9謁잧릸皡䆡人涱ꆬ쪰亱\uf3a9箲\uf5b3\ue2a9墥튩뎦꒤謁잧릸䎡\u0a0d\u0a0d\u0a0d\u0a0d\ue2a9謁잧릸撬\u0a0d펭䢤䶱疡\ue2a9謁잧릸皡亱\uf3a9ꐱ㋫ꐹ⣩䂤ꤩ곳낡뫊꿴뚸\ua97d곱롤ꇟൃഊ\u200a\u0a0d\u0a0d꒤謁\ue2bb直\ueab8\ue6ae\u0a0d꒤謁傦꾤랶\uf3a9炤约嶩ꆬ쪰늵\uf4a7斫䆡첾뮥䢤퇃侧틃\uf3a9箲\uf5b3疡涱纫䦧ꮴ䊳皡뇃\ue2bb䆡侹즮\uf8b5傦\uf1a9\uf3b1䎡\u0a0d䂡낡붤墥傦꾤亱侫撯謁떶䆡璥澵熳뺪\ue2bb直䎡\u0a0d\u0a0d䂡낡瞩솴疸떹䢤ﮭꎤ\ue3a8ꆬ쪰\uedb7톤疡厯侧宥墽謁떶皡Ꞥ\ue2a9謁\ueab8\ue6ae䎡\u0a0d\u0a0dഠഊ갊낡곊뢢ෟ먊꧖띥ꩼ끁뇈뵍ㅵ㠸㠸\ue0c2ഴഊ',...

1. Unable to process large files (like 'covid19treatmentguidelines2.pdf' attached below) in less time. Taking time of around 20 mins to process it. ``` from unstructured.partition.pdf import partition_pdf elements = partition_pdf(file_path,...

**Is your feature request related to a problem? Please describe.** Following up with the document hierarchy implementation, it'll be helpful to have a built-in function to group elements with the...

enhancement
good first issue

The package does not relay telemetry to packages.unstructured.io if `DO_NOT_TRACK` is set to `true` or if `SCARF_NO_ANALYTICS` is set to `true`. Given that this telemetry is enabled by default and...

**Describe the bug** passing `unstructured.cleaners.core.group_bullet_paragraph` to `UnstructuredBaseLoader`'s `post_processors` will cause the code to break, because `group_bullet_paragraph` returns a `List[str]`, and `unstructured.documents.elements.Text.apply()` method checks the output of `group_bullet_paragraph`, and throws an...

bug

I am using the hi_res model locally and tried it both with and without chunking as well. I also tried the chipper model via api, but faced similar issues as...

bug

I'm getting exception Please someone help me. office365.runtime.client_request_exception.ClientRequestException: ('AccessDenied', 'Either scp or roles claim need to be present in the token.', '403 Client Error: Forbidden for url: https://graph.microsoft.com/v1.0/users/[email protected]/drive') ![image](https://github.com/Unstructured-IO/unstructured/assets/60917433/fc7eb3ac-b137-4654-a9d0-c67dd6b148c6)

bug