kotaemon icon indicating copy to clipboard operation
kotaemon copied to clipboard

[BUG] Word docx failing embedding

Open vap0rtranz opened this issue 1 year ago • 2 comments

Description

Embeddings are failing for Word docx format.

The unstructured loader/reader gives an error.

This is using nomic-embed-text

Reproduction steps

1. In UI, select "Click to Upload" and attach local Word docx 
2. Select "Upload and Index"
3. see

Screenshots

![DESCRIPTION](LINK.png)

Logs

Using reader <kotaemon.loaders.unstructured_loader.UnstructuredReader object at 0x7f984bfba020>
No module named 'unstructured'
Traceback (most recent call last):
  File "/media/justin/external/CodeReady/venv-external/lib/python3.10/site-packages/ktem/index/file/pipelines.py", line 795, in stream
    file_id, docs = yield from pipeline.stream(
  File "/media/justin/external/CodeReady/venv-external/lib/python3.10/site-packages/ktem/index/file/pipelines.py", line 642, in stream
    docs = self.loader.load_data(file_path, extra_info=extra_info)
  File "/media/justin/external/CodeReady/venv-external/lib/python3.10/site-packages/kotaemon/loaders/unstructured_loader.py", line 70, in load_data
    from unstructured.partition.auto import partition
ModuleNotFoundError: No module named 'unstructured'

Browsers

No response

OS

Linux

Additional information

No response

vap0rtranz avatar Oct 28 '24 00:10 vap0rtranz

The module named 'unstructured' might not be installed. You can install it using pip: pip install unstructured.

KKenny0 avatar Oct 31 '24 03:10 KKenny0

Hmm, OK I installed unstructured. It was indeed not installed. Now there's a different error that blocks the indexing.

It may be faster to reinstall but I've had installation issues: https://github.com/Cinnamon/kotaemon/issues/425

vap0rtranz avatar Nov 01 '24 23:11 vap0rtranz