unstructured
unstructured copied to clipboard
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
related to issue #2664 Not at all confident with the second commit. I ran the make command in a new python env, but somehow, a lot of things seem to...
First of all, really cool software 💯 While doing a license check, I noticed that the `pillow-heif` dependency is actually GPLv2 with the binary wheels. Source: https://github.com/bigcat88/pillow_heif/issues/111 I think we...
This minor change updates the URL of the [Weaviate Docker image](https://weaviate.io/developers/weaviate/installation/docker-compose). Instead of the standard Docker registry, Weaviate now makes use of a custom registry running at `cr.weaviate.io`. Thanks in...
Add support for detecting table caption tags within tables and by themselves. More on caption tags: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/caption
**Summary** The indent-level for a bullet in DOCX is stored in the XML as an `int`. However, Word is tolerant of a floating-point value in that field and does not...
**Describe the bug** I came across a webpage which is being detected as a CSV file. It should be detected as html. The page, unfortunately, returns its content type as:...
The current Platform Documentation listed below does not mention required permissions for the Google Cloud service account keys. _Requested changes:_ On the Google Cloud Service source connector documentation https://unstructured-io.github.io/unstructured/platforms/platform_sources/google_cloud_source.html, can...
**Describe the bug** Unable to run [unstructured chunking](https://unstructured-io.github.io/unstructured/core/chunking.html#calling-a-chunking-function). I'm getting PDFPageCountError. **To Reproduce** Same as above **Expected behavior** Run smoothly **Screenshots** If applicable, add screenshots to help explain your problem....
Adds a src and dest connector for Kafka
**Describe the bug** Getting an error when using unstructured + langchain. Only happens in 0.12.6. Cannot repro in 0.12.5. The error: ``` 55 IS_PYSTON = hasattr(sys, "pyston_version_info") 56 HAS_REFCOUNT =...