llm-rag-vectordb-python
llm-rag-vectordb-python copied to clipboard
Bump unstructured from 0.10.16 to 0.14.3 in /data-analysis-tool
Bumps unstructured from 0.10.16 to 0.14.3.
Release notes
Sourced from unstructured's releases.
0.14.3
Enhancements
- Move
categoryfield from Text class to Element class.partition_docx()now supports pluggable picture sub-partitioners. A subpartitioner that accepts a DOCXParagraphand generates elements is now supported. This allows adding a custom sub-partitioner that extracts images and applies OCR or summarization for the image.- Add VoyageAI embedder Adds VoyageAI embeddings to support embedding via Voyage AI.
Features
Fixes
- Fix
partition_pdf()to keep spaces in the text. The control character\tis now replaced with a space instead of being removed when merging inferred elements with embedded elements.- Turn off XML resolve entities Sets
resolve_entities=Falsefor XML parsing withlxmlto avoid text being dynamically injected into the XML document.- Add backward compatibility for the deprecated pdf_infer_table_structure parameter.
- Add the missing
form_extraction_skip_tablesargument to thepartition_pdf_or_imagecall. to avoid text being dynamically injected into the XML document.- Chromadb change from Add to Upsert using element_id to make idempotent
- Diable
table_as_cellsoutput by default to reduce overhead in partition; nowtable_as_cellsis only produced when the envEXTACT_TABLE_AS_CELLSistrue- Reduce excessive logging Change per page ocr info level logging into detail level trace logging
- Replace try block in
document_to_element_listfor handling HTMLDocument Usegetattr(element, "type", "")to get thetypeattribute of an element when it exists. This is more explicit way to handle the special case for HTML documents and prevents other types of attribute error from being silenced by the try block0.14.2
Enhancements
- Bump unstructured-inference==0.7.33.
Features
- Add attribution to the
pineconeconnector.0.14.1
Enhancements
- Refactor code related to embedded text extraction. The embedded text extraction code is moved from
unstructured-inferencetounstructured.Features
- Large improvements to the ingest process:
- Support for multiprocessing and async, with limits for both.
- Streamlined to process when mapping CLI invocations to the underlying code
- More granular steps introduced to give better control over process (i.e. dedicated step to uncompress files already in the local filesystem, new optional staging step before upload)
- Use the python client when calling the unstructured api for partitioning or chunking
- Saving the final content is now a dedicated destination connector (local) set as the default if none are provided. Avoids adding new files locally if uploading elsewhere.
- Leverage last modified date when deciding if new files should be downloaded and reprocessed.
- Add attribution to the
pineconeconnector- Add support for Python 3.12.
unstructurednow works with Python 3.12!0.14.0
... (truncated)
Changelog
Sourced from unstructured's changelog.
0.14.3
Enhancements
- Move
categoryfield from Text class to Element class.partition_docx()now supports pluggable picture sub-partitioners. A subpartitioner that accepts a DOCXParagraphand generates elements is now supported. This allows adding a custom sub-partitioner that extracts images and applies OCR or summarization for the image.- Add VoyageAI embedder Adds VoyageAI embeddings to support embedding via Voyage AI.
Features
Fixes
- Fix
partition_pdf()to keep spaces in the text. The control character\tis now replaced with a space instead of being removed when merging inferred elements with embedded elements.- Turn off XML resolve entities Sets
resolve_entities=Falsefor XML parsing withlxmlto avoid text being dynamically injected into the XML document.- Add backward compatibility for the deprecated pdf_infer_table_structure parameter.
- Add the missing
form_extraction_skip_tablesargument to thepartition_pdf_or_imagecall. to avoid text being dynamically injected into the XML document.- Chromadb change from Add to Upsert using element_id to make idempotent
- Diable
table_as_cellsoutput by default to reduce overhead in partition; nowtable_as_cellsis only produced when the envEXTACT_TABLE_AS_CELLSistrue- Reduce excessive logging Change per page ocr info level logging into detail level trace logging
- Replace try block in
document_to_element_listfor handling HTMLDocument Usegetattr(element, "type", "")to get thetypeattribute of an element when it exists. This is more explicit way to handle the special case for HTML documents and prevents other types of attribute error from being silenced by the try block0.14.2
Enhancements
- Bump unstructured-inference==0.7.33.
Features
- Add attribution to the
pineconeconnector.Fixes
0.14.1
Enhancements
- Refactor code related to embedded text extraction. The embedded text extraction code is moved from
unstructured-inferencetounstructured.Features
- Large improvements to the ingest process:
- Support for multiprocessing and async, with limits for both.
- Streamlined to process when mapping CLI invocations to the underlying code
- More granular steps introduced to give better control over process (i.e. dedicated step to uncompress files already in the local filesystem, new optional staging step before upload)
- Use the python client when calling the unstructured api for partitioning or chunking
- Saving the final content is now a dedicated destination connector (local) set as the default if none are provided. Avoids adding new files locally if uploading elsewhere.
- Leverage last modified date when deciding if new files should be downloaded and reprocessed.
... (truncated)
Commits
f445724fix:partition_pdf()removes spaces from the text (#3106)3158169fix: uninstall bson for mongo connector (#3104)6b400b4feat: add VoyageAI embeddings (#3069) (#3099)32df4eefix: disable table_as_cells output by default (#3093)809c7e5chore: reduce excessive logging (#3095)26d403dfix: add missing params to ElementMetadata (#3092)35ec21efix: decide table extraction (#3090)31a53c8Fix: Chroma Upsert instead of Add (#3086)47d2861feat(docx): add pluggable picture sub-partitioner (#3081)171b5dffix: setresolve_entities=Falseinpartition_xml(#3088)- Additional commits viewable in compare view
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.
Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR:
@dependabot rebasewill rebase this PR@dependabot recreatewill recreate this PR, overwriting any edits that have been made to it@dependabot mergewill merge this PR after your CI passes on it@dependabot squash and mergewill squash and merge this PR after your CI passes on it@dependabot cancel mergewill cancel a previously requested merge and block automerging@dependabot reopenwill reopen this PR if it is closed@dependabot closewill close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually@dependabot show <dependency name> ignore conditionswill show all of the ignore conditions of the specified dependency@dependabot ignore this major versionwill close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this minor versionwill close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this dependencywill close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the Security Alerts page.