Verba icon indicating copy to clipboard operation
Verba copied to clipboard

'NoneType' object has no attribute 'tokenize'

Open micuentadecasa opened this issue 2 years ago • 12 comments

I'm using Cohere and unstructured, and I'm receiving that error when trying to load a pdf. It works ok with the simple reader, but not with the options for PDF.

this is the log:

ℹ Received Data to Import: READER(PDFReader, Documents 1, Type Documentation) CHUNKER (TokenChunker, UNITS 250, OVERLAP 50), EMBEDDER (MiniLMEmbedder) ✔ Loaded ai-03-00057.pdf ✔ Loaded 1 documents Chunking documents: 100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 37.20it/s] ✔ Chunking completed Vectorizing document chunks: 0%| | 0/1 [00:00<?, ?it/s] ✘ Loading data failed 'NoneType' object has no attribute 'tokenize'

Regards.

micuentadecasa avatar Nov 23 '23 22:11 micuentadecasa

Thanks for the issue! It looks like you're using the SentenceTransformer MiniLM model to embed the chunks, is that intended? It might be that there are some missing dependencies, are you running Verba on a new python environment?

thomashacker avatar Nov 24 '23 13:11 thomashacker

I tried all the possibilities, using the MiniLM was just a try.

this is the log I got in other try

ℹ Received Data to Import: READER(UnstructuredPDF, Documents 1, Type Documentation) CHUNKER (SentenceChunker, UNITS 3, OVERLAP 2), EMBEDDER (CohereEmbedder) ✔ Loaded xxx.pdf ✔ Loaded 1 documents Chunking documents: 100%|██████████| 1/1 [00:00<00:00, 28.90it/s] ✔ Chunking completed ℹ (1/1) Importing document xxxx.pdf with 2 batches ✘ {'errors': {'error': [{'message': 'update vector: API Key: no api key found neither in request header: X-Openai-Api-Key nor in environment variable under OPENAI_APIKEY'}]}, 'status': 'FAILED'} Importing batches: 100%|██████████| 2/2 [00:03<00:00, 1.80s/it] ✘ Loading data failed Document 09a44f39-fb85-4182-b853-b0990925f7fc not found None

it seems it is trying to use the OPENAI even it is set the Cohere one.

Regards.

micuentadecasa avatar Nov 26 '23 22:11 micuentadecasa

Thanks for the insights! I'll look into fixing this 👍

thomashacker avatar Nov 28 '23 17:11 thomashacker

We merged some fixes, are you still getting these errors?

thomashacker avatar Dec 05 '23 10:12 thomashacker

I was getting the same error and found out that it was due to:

⚠ Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate:
`pip install accelerate`

Perhaps adding accelerate as a direct dependency to Verba is desirable?

f0rmiga avatar Jan 14 '24 00:01 f0rmiga

https://github.com/weaviate/Verba/blob/1c9d4b49385315883ba0027ac1772a8b448f6204/goldenverba/components/embedding/MiniLMEmbedder.py#L26-L42

device_map should be type str or dict not torch.device type

https://github.com/huggingface/transformers/blob/edb170238febf7fc3e3278ed5b9ca0b2c40c70e3/src/transformers/tools/base.py#L460-L461

rayliuca avatar Jan 16 '24 00:01 rayliuca

I was getting the same error when using MiniLMEmbedder on my mac that doesn't have a cuda gpu. So I tried @f0rmiga solution and I updated my code like this:

from accelerate import Accelerator accelerator = Accelerator()

After self.device = get_device() I added self.device = accelerator.device

Now MiniLMEmbedder works fine and the document's chunks are being vectorized.

moncefarajdal avatar Mar 15 '24 09:03 moncefarajdal

This should be fixed with the newest v1.0.0 version!

thomashacker avatar May 15 '24 17:05 thomashacker

with AdaEmbedder in Azure Openai issue still persists

✘ {'errors': {'error': [{'message': "update vector: unmarshal response body: invalid character '<' looking for beginning of value"}]}, 'status': 'FAILED'}

I am using goldenverba Version: 1.0.1 Also inside schema_generation.py

"text2vec-openai": {"deploymentId": , "resourceName": }, "baseURL"} are defined and correct.

sbhadana avatar May 18 '24 17:05 sbhadana

What openai version have you installed?

thomashacker avatar May 19 '24 17:05 thomashacker

I have installed version: 0.27.9 however I tried with 1.30.1 also same error

sbhadana avatar May 20 '24 11:05 sbhadana

Make sure to use the 0.27.9 version, I'll take a closer look at the Azure Implementation

thomashacker avatar May 20 '24 18:05 thomashacker