openkaito icon indicating copy to clipboard operation
openkaito copied to clipboard

Issue with Setup Semantic Search Dataset & Indexing

Open atbdeveloper opened this issue 1 year ago • 1 comments

I've got an issue while indexing the ETH Denver 2024 dataset for semantic search.

19129 files in /home/tao/Development/openkaito/datasets/eth_denver_dataset Index already exists: eth_denver Number of docs in eth_denver: 19129 == total files 19129, no need to reindex docs Indexing embeddings: 5%|███▌ | 1000/19129 [09:44<2:56:33, 1.71it/s] ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /home/tao/Development/openkaito/scripts/vector_index_eth_denver_dataset.py:186 in │ │ │ │ 183 │ │ │ f"Number of docs in {index_name}: {r['count']} == total files {num_files}, n │ │ 184 │ │ ) │ │ 185 │ │ │ ❱ 186 │ indexing_embeddings(search_client) │ │ 187 │ │ │ 188 │ query = "What is the future of blockchain?" │ │ 189 │ response = test_retrieval(search_client, query, topk=5) │ │ │ │ /home/tao/Development/openkaito/scripts/vector_index_eth_denver_dataset.py:115 in │ │ indexing_embeddings │ │ │ │ 112 def indexing_embeddings(search_client): │ │ 113 │ """Index embeddings of documents in Elasticsearch""" │ │ 114 │ │ │ ❱ 115 │ for doc in tqdm( │ │ 116 │ │ helpers.scan(search_client, index=index_name), │ │ 117 │ │ desc="Indexing embeddings", │ │ 118 │ │ total=search_client.count(index=index_name)["count"], │ │ │ │ /home/tao/anaconda3/envs/openkaito/lib/python3.10/site-packages/tqdm/std.py:1181 in iter │ │ │ │ 1178 │ │ time = self._time │ │ 1179 │ │ │ │ 1180 │ │ try: │ │ ❱ 1181 │ │ │ for obj in iterable: │ │ 1182 │ │ │ │ yield obj │ │ 1183 │ │ │ │ # Update and possibly print the progressbar. │ │ 1184 │ │ │ │ # Note: does not call self.update(1) for speed optimisation. │ │ │ │ /home/tao/anaconda3/envs/openkaito/lib/python3.10/site-packages/elasticsearch/helpers/actions.py │ │ :755 in scan │ │ │ │ 752 │ │ │ │ │ │ │ shards_total, │ │ 753 │ │ │ │ │ │ ), │ │ 754 │ │ │ │ │ ) │ │ ❱ 755 │ │ │ resp = scroll_client.scroll( │ │ 756 │ │ │ │ scroll_id=scroll_id, scroll=scroll, **scroll_kwargs │ │ 757 │ │ │ ) │ │ 758 │ │ │ scroll_id = resp.get("_scroll_id") │ │ │ │ /home/tao/anaconda3/envs/openkaito/lib/python3.10/site-packages/elasticsearch/_sync/client/utils │ │ .py:446 in wrapped │ │ │ │ 443 │ │ │ │ │ except KeyError: │ │ 444 │ │ │ │ │ │ pass │ │ 445 │ │ │ │ │ ❱ 446 │ │ │ return api(*args, **kwargs) │ │ 447 │ │ │ │ 448 │ │ return wrapped # type: ignore[return-value] │ │ 449 │ │ │ │ /home/tao/anaconda3/envs/openkaito/lib/python3.10/site-packages/elasticsearch/_sync/client/ini │ │ t.py:3609 in scroll │ │ │ │ 3606 │ │ __headers = {"accept": "application/json"} │ │ 3607 │ │ if __body is not None: │ │ 3608 │ │ │ __headers["content-type"] = "application/json" │ │ ❱ 3609 │ │ return self.perform_request( # type: ignore[return-value] │ │ 3610 │ │ │ "POST", │ │ 3611 │ │ │ __path, │ │ 3612 │ │ │ params=__query, │ │ │ │ /home/tao/anaconda3/envs/openkaito/lib/python3.10/site-packages/elasticsearch/_sync/client/_base │ │ .py:271 in perform_request │ │ │ │ 268 │ │ │ endpoint_id=endpoint_id, │ │ 269 │ │ │ path_parts=path_parts or {}, │ │ 270 │ │ ) as otel_span: │ │ ❱ 271 │ │ │ response = self._perform_request( │ │ 272 │ │ │ │ method, │ │ 273 │ │ │ │ path, │ │ 274 │ │ │ │ params=params, │ │ │ │ /home/tao/anaconda3/envs/openkaito/lib/python3.10/site-packages/elasticsearch/_sync/client/_base │ │ .py:352 in _perform_request │ │ │ │ 349 │ │ │ │ except (ValueError, KeyError, TypeError): │ │ 350 │ │ │ │ │ pass │ │ 351 │ │ │ │ │ ❱ 352 │ │ │ raise HTTP_EXCEPTIONS.get(meta.status, ApiError)( │ │ 353 │ │ │ │ message=message, meta=meta, body=resp_body │ │ 354 │ │ │ ) │ │ 355 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ NotFoundError: NotFoundError(404, 'search_phase_execution_exception', 'No search context found for id [12]')

atbdeveloper avatar May 22 '24 00:05 atbdeveloper

This is usually because the iteration time is longer than the elasticsearch scroll time, you may adjust the scroll time in the scan() operation, https://elasticsearch-py.readthedocs.io/en/latest/helpers.html#scan:~:text=the%20search()%20api-,scroll,-(str)%20%E2%80%93%20Specify or you can accelerate the embedding process via batch execution etc.

yang-han avatar May 23 '24 04:05 yang-han

Closing this issue, feel free to reopen it if you still have question:)

yang-han avatar May 31 '24 04:05 yang-han