onyx
onyx copied to clipboard
Fail to index
Traceback (most recent call last):
File "/app/danswer/background/indexing/run_indexing.py", line 208, in _run_indexing
new_docs, total_batch_chunks = indexing_pipeline(
^^^^^^^^^^^^^^^^^^
File "/app/danswer/utils/timing.py", line 31, in wrapped_func
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/app/danswer/indexing/indexing_pipeline.py", line 201, in index_doc_batch
insertion_records = document_index.index(chunks=access_aware_chunks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/danswer/document_index/vespa/index.py", line 717, in index
return _clear_and_index_vespa_chunks(chunks=chunks, index_name=self.index_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/danswer/document_index/vespa/index.py", line 377, in _clear_and_index_vespa_chunks
_batch_index_vespa_chunks(
File "/app/danswer/document_index/vespa/index.py", line 332, in _batch_index_vespa_chunks
future.result()
File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/retry/api.py", line 73, in retry_decorator
return __retry_internal(partial(f, *args, **kwargs), exceptions, tries, delay, max_delay, backoff, jitter,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/retry/api.py", line 33, in __retry_internal
return f()
^^^
File "/app/danswer/document_index/vespa/index.py", line 310, in _index_vespa_chunk
raise e
File "/app/danswer/document_index/vespa/index.py", line 305, in _index_vespa_chunk
res.raise_for_status()
File "/usr/local/lib/python3.11/site-packages/httpx/_models.py", line 749, in raise_for_status
raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '400 Bad Request' for url 'http://index:8081/document/v1/default/danswer_chunk/docid/cbf9da65-dba1-56a6-8f9a-342a7002329b'
For more information check: https://httpstatuses.com/400
I can access the index (danswer-stack-index-1) by trying to use curl
curl http://index:8081/document/v1/default/danswer_chunk/docid/cbf9da65-dba1-56a6-8f9a-342a70023
29b
{"pathId":"/document/v1/default/danswer_chunk/docid/cbf9da65-dba1-56a6-8f9a-342a7002329b","id":"id:default:danswer_chunk::cbf9da65-dba1-56a6-8f9a-342a7002329b"}
hello, have you resolved this issue. I met the same error
The same problem here
same here
I am having same issue here as well, i installed curl into the background container and I have no errors accessing the link, but my documents fail
I discovered one of my sites was failing due to a blocked user-agent. the default scraper uses the basic 'python-requests' user-agent, and some websites are blocking it. hacking in an alternate user-agent string the various requests.get functions in the web connector (danswer/connectors/web/connector.py) allowed the scraper to work.
I had the same issue: specifying a user agent didn't solve it for me. tried varies sites, At the start it scrapes a few link, but after about 10 links i get the 400 error (updating indexing or re-indexing always results in error 400).
The Solution for me was to clone the repo again and rebuild the containers, I had moved files around and I something was messed up with executable file permissions on a script.
I have the same error. No document can be indexed regardless of the source: file or web
Traceback (most recent call last):
File "/app/danswer/background/indexing/run_indexing.py", line 190, in _run_indexing
new_docs, total_batch_chunks = indexing_pipeline(
^^^^^^^^^^^^^^^^^^
File "/app/danswer/utils/timing.py", line 31, in wrapped_func
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/app/danswer/indexing/indexing_pipeline.py", line 199, in index_doc_batch
insertion_records = document_index.index(chunks=access_aware_chunks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/danswer/document_index/vespa/index.py", line 824, in index
return _clear_and_index_vespa_chunks(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/app/danswer/document_index/vespa/index.py", line 449, in _clear_and_index_vespa_chunks
_batch_index_vespa_chunks(
File "/app/danswer/document_index/vespa/index.py", line 404, in _batch_index_vespa_chunks
future.result()
File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/retry/api.py", line 73, in retry_decorator
return __retry_internal(partial(f, *args, **kwargs), exceptions, tries, delay, max_delay, backoff, jitter,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/retry/api.py", line 33, in __retry_internal
return f()
^^^
File "/app/danswer/document_index/vespa/index.py", line 382, in _index_vespa_chunk
raise e
File "/app/danswer/document_index/vespa/index.py", line 377, in _index_vespa_chunk
res.raise_for_status()
File "/usr/local/lib/python3.11/site-packages/httpx/_models.py", line 749, in raise_for_status
raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '400 Bad Request' for url 'http://index:8081/document/v1/default/danswer_chunk/docid/36478b3c-8d00-5601-860b-40b63255d3e4'
For more information check: https://httpstatuses.com/400
Another error included in background service log (url of the document has been anonymized)
07/11/2024 09:05:17 AM index.py 379 : [Attempt ID: 1] Failed to index document: 'https://....'. Got response: '{"pathId":"/document/v1/default/danswer_chunk/docid/dd3fdfbb-fee6-57e3-a93d-4858cc0c18bc","message":"Error in document 'id:default:danswer_chunk::dd3fdfbb-fee6-57e3-a93d-4858cc0c18bc' - could not parse field 'embeddings' of type 'tensor<float>(t{},x[768])': At {t:full_chunk}: Expected 768 values, but got 384: At {t:full_chunk}: Expected 768 values, but got 384"}'
Traceback (most recent call last):
File "/app/danswer/document_index/vespa/index.py", line 377, in _index_vespa_chunk
res.raise_for_status()
File "/usr/local/lib/python3.11/site-packages/httpx/_models.py", line 749, in raise_for_status
raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '400 Bad Request' for url 'http://index:8081/document/v1/default/danswer_chunk/docid/dd3fdfbb-fee6-57e3-a93d-4858cc0c18bc'
For more information check: https://httpstatuses.com/400
I had the same problem, including the background service logging "Expected 768 values, but got 384" and I figured it out.
I have DOCUMENT_ENCODER_MODEL="intfloat/multilingual-e5-small"
in my .env
file and that embedding model uses 384 dimensions, but the index expects 768 dimensions by default. Adding the setting DOC_EMBEDDING_DIM=384
in .env
makes everything work for me.
@bartschuller Many thanks for the tip. The adjustment of the parameter DOC_EMBEDDING_DIM=384
has solved my problem.
Did you guys have to start over with an empty vespa instance? I've applied the .env parameter but I'm still getting the same error. Curiously I have an almost identical setup running on another server that doesn't manifest this problem.
@jcerar I guess so, the setting ultimately has to translate into a different index configuration for Vespa, so that means you need to start fresh.
Thanks. The vespa clean slate alone didn't do it for me but your comment put me on the right track. I wanted to retain the contents of my relational db so I didn't want to start over from scratch. The following sql statement on the psql db is the secret sauce that eventually worked for me:
update embedding_model set model_dim=384 where model_name='intfloat/multilingual-e5-small';
Tore everything down, nuked vespa volume, brought back up & indexing now works.
I was getting 403 for one of my sites. Theres an issue with default user agent. Modified connector.py. added following lines in check_internet_connection method:
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3' response = requests.get(url, headers=headers, timeout=3)
Now it works