Verba
Verba copied to clipboard
'NoneType' object has no attribute 'tokenize'
I'm using Cohere and unstructured, and I'm receiving that error when trying to load a pdf. It works ok with the simple reader, but not with the options for PDF.
this is the log:
ℹ Received Data to Import: READER(PDFReader, Documents 1, Type Documentation) CHUNKER (TokenChunker, UNITS 250, OVERLAP 50), EMBEDDER (MiniLMEmbedder) ✔ Loaded ai-03-00057.pdf ✔ Loaded 1 documents Chunking documents: 100%|████████████████████████████████████████████| 1/1 [00:00<00:00, 37.20it/s] ✔ Chunking completed Vectorizing document chunks: 0%| | 0/1 [00:00<?, ?it/s] ✘ Loading data failed 'NoneType' object has no attribute 'tokenize'
Regards.
Thanks for the issue! It looks like you're using the SentenceTransformer MiniLM model to embed the chunks, is that intended? It might be that there are some missing dependencies, are you running Verba on a new python environment?
I tried all the possibilities, using the MiniLM was just a try.
this is the log I got in other try
ℹ Received Data to Import: READER(UnstructuredPDF, Documents 1, Type Documentation) CHUNKER (SentenceChunker, UNITS 3, OVERLAP 2), EMBEDDER (CohereEmbedder) ✔ Loaded xxx.pdf ✔ Loaded 1 documents Chunking documents: 100%|██████████| 1/1 [00:00<00:00, 28.90it/s] ✔ Chunking completed ℹ (1/1) Importing document xxxx.pdf with 2 batches ✘ {'errors': {'error': [{'message': 'update vector: API Key: no api key found neither in request header: X-Openai-Api-Key nor in environment variable under OPENAI_APIKEY'}]}, 'status': 'FAILED'} Importing batches: 100%|██████████| 2/2 [00:03<00:00, 1.80s/it] ✘ Loading data failed Document 09a44f39-fb85-4182-b853-b0990925f7fc not found None
it seems it is trying to use the OPENAI even it is set the Cohere one.
Regards.
Thanks for the insights! I'll look into fixing this 👍
We merged some fixes, are you still getting these errors?
I was getting the same error and found out that it was due to:
⚠ Using `low_cpu_mem_usage=True` or a `device_map` requires Accelerate:
`pip install accelerate`
Perhaps adding accelerate as a direct dependency to Verba is desirable?
https://github.com/weaviate/Verba/blob/1c9d4b49385315883ba0027ac1772a8b448f6204/goldenverba/components/embedding/MiniLMEmbedder.py#L26-L42
device_map should be type str or dict not torch.device type
https://github.com/huggingface/transformers/blob/edb170238febf7fc3e3278ed5b9ca0b2c40c70e3/src/transformers/tools/base.py#L460-L461
I was getting the same error when using MiniLMEmbedder on my mac that doesn't have a cuda gpu. So I tried @f0rmiga solution and I updated my code like this:
from accelerate import Accelerator
accelerator = Accelerator()
After self.device = get_device() I added self.device = accelerator.device
Now MiniLMEmbedder works fine and the document's chunks are being vectorized.
This should be fixed with the newest v1.0.0 version!
with AdaEmbedder in Azure Openai issue still persists
✘ {'errors': {'error': [{'message': "update vector: unmarshal response body: invalid character '<' looking for beginning of value"}]}, 'status': 'FAILED'}
I am using goldenverba Version: 1.0.1 Also inside schema_generation.py
"text2vec-openai": {"deploymentId": , "resourceName": }, "baseURL"} are defined and correct.
What openai version have you installed?
I have installed version: 0.27.9 however I tried with 1.30.1 also same error
Make sure to use the 0.27.9 version, I'll take a closer look at the Azure Implementation