onyx icon indicating copy to clipboard operation
onyx copied to clipboard

Fail to index

Open dtthanh1971 opened this issue 11 months ago • 12 comments

Traceback (most recent call last):
  File "/app/danswer/background/indexing/run_indexing.py", line 208, in _run_indexing
    new_docs, total_batch_chunks = indexing_pipeline(
                                   ^^^^^^^^^^^^^^^^^^
  File "/app/danswer/utils/timing.py", line 31, in wrapped_func
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/app/danswer/indexing/indexing_pipeline.py", line 201, in index_doc_batch
    insertion_records = document_index.index(chunks=access_aware_chunks)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/danswer/document_index/vespa/index.py", line 717, in index
    return _clear_and_index_vespa_chunks(chunks=chunks, index_name=self.index_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/danswer/document_index/vespa/index.py", line 377, in _clear_and_index_vespa_chunks
    _batch_index_vespa_chunks(
  File "/app/danswer/document_index/vespa/index.py", line 332, in _batch_index_vespa_chunks
    future.result()
  File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/retry/api.py", line 73, in retry_decorator
    return __retry_internal(partial(f, *args, **kwargs), exceptions, tries, delay, max_delay, backoff, jitter,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/retry/api.py", line 33, in __retry_internal
    return f()
           ^^^
  File "/app/danswer/document_index/vespa/index.py", line 310, in _index_vespa_chunk
    raise e
  File "/app/danswer/document_index/vespa/index.py", line 305, in _index_vespa_chunk
    res.raise_for_status()
  File "/usr/local/lib/python3.11/site-packages/httpx/_models.py", line 749, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '400 Bad Request' for url 'http://index:8081/document/v1/default/danswer_chunk/docid/cbf9da65-dba1-56a6-8f9a-342a7002329b'
For more information check: https://httpstatuses.com/400

I can access the index (danswer-stack-index-1) by trying to use curl

curl http://index:8081/document/v1/default/danswer_chunk/docid/cbf9da65-dba1-56a6-8f9a-342a70023
29b
{"pathId":"/document/v1/default/danswer_chunk/docid/cbf9da65-dba1-56a6-8f9a-342a7002329b","id":"id:default:danswer_chunk::cbf9da65-dba1-56a6-8f9a-342a7002329b"}

dtthanh1971 avatar Mar 10 '24 00:03 dtthanh1971

hello, have you resolved this issue. I met the same error

ccslience avatar Apr 29 '24 08:04 ccslience

The same problem here

mforbak avatar May 09 '24 16:05 mforbak

same here

tuggeluk avatar May 12 '24 16:05 tuggeluk

I am having same issue here as well, i installed curl into the background container and I have no errors accessing the link, but my documents fail

EmmanuelGirin avatar May 22 '24 14:05 EmmanuelGirin

I discovered one of my sites was failing due to a blocked user-agent. the default scraper uses the basic 'python-requests' user-agent, and some websites are blocking it. hacking in an alternate user-agent string the various requests.get functions in the web connector (danswer/connectors/web/connector.py) allowed the scraper to work.

carlchan avatar Jun 04 '24 15:06 carlchan

I had the same issue: specifying a user agent didn't solve it for me. tried varies sites, At the start it scrapes a few link, but after about 10 links i get the 400 error (updating indexing or re-indexing always results in error 400).

The Solution for me was to clone the repo again and rebuild the containers, I had moved files around and I something was messed up with executable file permissions on a script.

crNewton avatar Jun 10 '24 07:06 crNewton

I have the same error. No document can be indexed regardless of the source: file or web

Traceback (most recent call last):
  File "/app/danswer/background/indexing/run_indexing.py", line 190, in _run_indexing
    new_docs, total_batch_chunks = indexing_pipeline(
                                   ^^^^^^^^^^^^^^^^^^
  File "/app/danswer/utils/timing.py", line 31, in wrapped_func
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/app/danswer/indexing/indexing_pipeline.py", line 199, in index_doc_batch
    insertion_records = document_index.index(chunks=access_aware_chunks)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/danswer/document_index/vespa/index.py", line 824, in index
    return _clear_and_index_vespa_chunks(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/danswer/document_index/vespa/index.py", line 449, in _clear_and_index_vespa_chunks
    _batch_index_vespa_chunks(
  File "/app/danswer/document_index/vespa/index.py", line 404, in _batch_index_vespa_chunks
    future.result()
  File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/usr/local/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/retry/api.py", line 73, in retry_decorator
    return __retry_internal(partial(f, *args, **kwargs), exceptions, tries, delay, max_delay, backoff, jitter,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/retry/api.py", line 33, in __retry_internal
    return f()
           ^^^
  File "/app/danswer/document_index/vespa/index.py", line 382, in _index_vespa_chunk
    raise e
  File "/app/danswer/document_index/vespa/index.py", line 377, in _index_vespa_chunk
    res.raise_for_status()
  File "/usr/local/lib/python3.11/site-packages/httpx/_models.py", line 749, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '400 Bad Request' for url 'http://index:8081/document/v1/default/danswer_chunk/docid/36478b3c-8d00-5601-860b-40b63255d3e4'
For more information check: https://httpstatuses.com/400

Another error included in background service log (url of the document has been anonymized)

07/11/2024 09:05:17 AM             index.py 379 : [Attempt ID: 1] Failed to index document: 'https://....'. Got response: '{"pathId":"/document/v1/default/danswer_chunk/docid/dd3fdfbb-fee6-57e3-a93d-4858cc0c18bc","message":"Error in document 'id:default:danswer_chunk::dd3fdfbb-fee6-57e3-a93d-4858cc0c18bc' - could not parse field 'embeddings' of type 'tensor<float>(t{},x[768])': At {t:full_chunk}: Expected 768 values, but got 384: At {t:full_chunk}: Expected 768 values, but got 384"}'
Traceback (most recent call last):
  File "/app/danswer/document_index/vespa/index.py", line 377, in _index_vespa_chunk
    res.raise_for_status()
  File "/usr/local/lib/python3.11/site-packages/httpx/_models.py", line 749, in raise_for_status
    raise HTTPStatusError(message, request=request, response=self)
httpx.HTTPStatusError: Client error '400 Bad Request' for url 'http://index:8081/document/v1/default/danswer_chunk/docid/dd3fdfbb-fee6-57e3-a93d-4858cc0c18bc'
For more information check: https://httpstatuses.com/400

argauerc avatar Jul 11 '24 09:07 argauerc

I had the same problem, including the background service logging "Expected 768 values, but got 384" and I figured it out. I have DOCUMENT_ENCODER_MODEL="intfloat/multilingual-e5-small" in my .env file and that embedding model uses 384 dimensions, but the index expects 768 dimensions by default. Adding the setting DOC_EMBEDDING_DIM=384 in .env makes everything work for me.

bartschuller avatar Jul 26 '24 09:07 bartschuller

@bartschuller Many thanks for the tip. The adjustment of the parameter DOC_EMBEDDING_DIM=384 has solved my problem.

argauerc avatar Jul 29 '24 13:07 argauerc

Did you guys have to start over with an empty vespa instance? I've applied the .env parameter but I'm still getting the same error. Curiously I have an almost identical setup running on another server that doesn't manifest this problem.

jcerar avatar Aug 08 '24 13:08 jcerar

@jcerar I guess so, the setting ultimately has to translate into a different index configuration for Vespa, so that means you need to start fresh.

bartschuller avatar Aug 08 '24 13:08 bartschuller

Thanks. The vespa clean slate alone didn't do it for me but your comment put me on the right track. I wanted to retain the contents of my relational db so I didn't want to start over from scratch. The following sql statement on the psql db is the secret sauce that eventually worked for me:

update embedding_model set model_dim=384 where model_name='intfloat/multilingual-e5-small';

Tore everything down, nuked vespa volume, brought back up & indexing now works.

jcerar avatar Aug 08 '24 14:08 jcerar

I was getting 403 for one of my sites. Theres an issue with default user agent. Modified connector.py. added following lines in check_internet_connection method:

headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3' response = requests.get(url, headers=headers, timeout=3) Now it works

vasylmukhar avatar Aug 29 '24 11:08 vasylmukhar