unstructured
unstructured copied to clipboard
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
**Describe the bug** I am using the partition_pdf function to extract tables from a PDF file publicly available (https://www.highmarkbcbswv.com/PDFFiles/ANSI-reason-codes.pdf). After running the OCR, the elements contain every row of the...
In trying to load a JSON file (structured as below) with a call to `elements = partition(filename=f)`, I get the error message in the title. ``` [ { "key1": "val1",...
**Is your feature request related to a problem? Please describe.** Currently, element_id's are simply a hash of the element's text. This is not great, since id's may then be duplicated...
![image](https://github.com/Unstructured-IO/unstructured/assets/74747729/e6d6c4c1-3dc9-42e3-9974-6dd1b65024f9) When saving a file using Outlook and selecting the (*.msg) format instead of Unicode format, the text loaded by UnstructuredFileLoader will appear as garbled characters. Like this: ``` [Document(page_content='\u0a0d\u0a0d톰宥\ueab8\ue6ae\u0a0dꆬ쪰\uedb7\ue9a4ㄨ㌱约ꐲ㋫\ue9a4꘩쉢ꖾꙂ쎳꾺뫇Ꟗ꩑ꓷꖧ외끗ꗏ엾ꛩꑐꆯൃഊ\u200a\u0a0d\u0a0d\ue2a9謁\ue8a4ꆦ\u0a0d\u0a0d떤约亱쒱캥疡욼\ueca6人涱皡䆡䎨䢤튬뎦䶱疡\ue2a9謁잧릸皡䆡人涱ꆬ쪰亱\uf3a9箲\uf5b3\ue2a9墥튩뎦꒤謁잧릸䎡\u0a0d\u0a0d\u0a0d\u0a0d\ue2a9謁잧릸撬\u0a0d펭䢤䶱疡\ue2a9謁잧릸皡亱\uf3a9ꐱ㋫ꐹ⣩䂤ꤩ곳낡뫊꿴뚸\ua97d곱롤ꇟൃഊ\u200a\u0a0d\u0a0d꒤謁\ue2bb直\ueab8\ue6ae\u0a0d꒤謁傦꾤랶\uf3a9炤约嶩ꆬ쪰늵\uf4a7斫䆡첾뮥䢤퇃侧틃\uf3a9箲\uf5b3疡涱纫䦧ꮴ䊳皡뇃\ue2bb䆡侹즮\uf8b5傦\uf1a9\uf3b1䎡\u0a0d䂡낡붤墥傦꾤亱侫撯謁떶䆡璥澵熳뺪\ue2bb直䎡\u0a0d\u0a0d䂡낡瞩솴疸떹䢤ﮭꎤ\ue3a8ꆬ쪰\uedb7톤疡厯侧宥墽謁떶皡Ꞥ\ue2a9謁\ueab8\ue6ae䎡\u0a0d\u0a0dഠഊ갊낡곊뢢ෟ먊꧖띥ꩼ끁뇈뵍ㅵ㠸㠸\ue0c2ഴഊ',...
1. Unable to process large files (like 'covid19treatmentguidelines2.pdf' attached below) in less time. Taking time of around 20 mins to process it. ``` from unstructured.partition.pdf import partition_pdf elements = partition_pdf(file_path,...
**Is your feature request related to a problem? Please describe.** Following up with the document hierarchy implementation, it'll be helpful to have a built-in function to group elements with the...
The package does not relay telemetry to packages.unstructured.io if `DO_NOT_TRACK` is set to `true` or if `SCARF_NO_ANALYTICS` is set to `true`. Given that this telemetry is enabled by default and...
**Describe the bug** passing `unstructured.cleaners.core.group_bullet_paragraph` to `UnstructuredBaseLoader`'s `post_processors` will cause the code to break, because `group_bullet_paragraph` returns a `List[str]`, and `unstructured.documents.elements.Text.apply()` method checks the output of `group_bullet_paragraph`, and throws an...
I am using the hi_res model locally and tried it both with and without chunking as well. I also tried the chipper model via api, but faced similar issues as...
I'm getting exception Please someone help me. office365.runtime.client_request_exception.ClientRequestException: ('AccessDenied', 'Either scp or roles claim need to be present in the token.', '403 Client Error: Forbidden for url: https://graph.microsoft.com/v1.0/users/[email protected]/drive') ![image](https://github.com/Unstructured-IO/unstructured/assets/60917433/fc7eb3ac-b137-4654-a9d0-c67dd6b148c6)