k-NN
k-NN copied to clipboard
[BUG] Knn Search Fails When Repeatedly Deleting and Inserting Vectors.
Describe the bug When performing Knn search queries on an index multiple times, with documents being deleted and inserted, the search occasionally does not return any hits.
To Reproduce
- Create a Knn index.
- Generate a vector to be used during tests.
- Add a document with the vector and refresh the index.
- Search for that vector and retrieve the document ID.
- Delete the document with the retrieved ID.
- Repeat steps 3-5 until the search returns no hits.
Expected behavior The search query should consistently return hits as long as there are documents in the index.
Plugins Please list all plugins currently enabled.
Screenshots If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
- Operating System: M1 Mac (Also occurred on Linux/ARM64)
- OpenSearch Version: 2.7.0/2.9.0
Additional context Python script that reproduces issue:
import random
# Disable insecure request warning
requests.packages.urllib3.disable_warnings(requests.packages.urllib3.exceptions.InsecureRequestWarning)
# OpenSearch cluster configuration
opensearch_url = 'https://admin:admin@localhost:9200'
opensearch_index = ''.join(random.choice('abcdefghijklmnopqrstuvwxyz') for _ in range(10))
vec = [random.random() for _ in range(384)]
# Function to interact with OpenSearch
def opensearch_request(method, endpoint, data=None):
url = f'{opensearch_url}/{opensearch_index}/{endpoint}'
headers = {'Content-Type': 'application/json'}
verify = False
response = requests.request(method, url, json=data, headers=headers, verify=verify)
return response
# Create the OpenSearch index
index_mapping = {...} # Your index mapping here
print(opensearch_request('PUT', '', index_mapping).text)
# Main loop
for i in range(1000000):
doc = {
'__chunks': {
'__field_name': f'field_{random.randint(1, 100)}',
'__field_content': f'content_{random.randint(1, 100)}',
'__vector_marqo_knn_field': vec
}
}
opensearch_request('POST', '_doc', doc)
opensearch_request('POST', '_refresh')
knn_query = {
"knn": {
"__chunks.__vector_marqo_knn_field": {
"vector": vec,
"k": 100
}
}
}
full_knn_query = {
"size": 100,
"from": 0,
"_source": { # Exclude the vector field from the snippet
"exclude": ["__chunks.__vector_marqo_knn_field"]
},
"query": {
"nested": {
"path": "__chunks",
"inner_hits": {
"_source": {
"include": ["__chunks.__field_content", "__chunks.__field_name"]
}
},
"query": knn_query
}
}
}
search_results = opensearch_request('POST', '_search', full_knn_query).json()
doc_ids = [hit['_id'] for hit in search_results.get('hits', {}).get('hits', [])]
if doc_ids:
print(f'Iteration {i}: {len(doc_ids)} results found. Deleting them.')
opensearch_request('DELETE', f'_doc/{",".join(doc_ids)}')
else:
print(f'Iteration {i}: No results found.')
break
to_delete_index = input("Delete the index? (y/n): ")
if to_delete_index.lower() == "y":
opensearch_request('DELETE', '')
print("Script completed.")
The iteration that it fails on is also consistent no matter of what vector is used, but dimensionality of vector amplifies which iteration it fails. With: 384 dimensions - 145th iteration. 512 dimensions - 109th iteration. 121 dimensions - 69th iteration. 728 dimensions - 109th iteration 1024 dimensions - 82nd iteration.
Same results are produced with ViT-L/14 and hf/all_datasets_v4_MiniLM-L6
Does forcemerging after each deletion help?
Unfortunately forcemerging didn't help
Moved this to the k-nn repo.
Index mapping that was used
{
"settings": {
"index": {
"knn": True,
"knn.algo_param.ef_search": 100,
"refresh_interval": "1s",
"store.hybrid.mmap.extensions": [
"nvd", "dvd", "tim", "tip", "dim", "kdd", "kdi", "cfs", "doc", "vec", "vex"
]
},
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"_meta": {
"media_type": "text",
"index_settings": {
"index_defaults": {
"treat_urls_and_pointers_as_images": False,
"model": "hf/all_datasets_v4_MiniLM-L6",
"normalize_embeddings": True,
"text_preprocessing": {
"split_length": 2,
"split_overlap": 0,
"split_method": "sentence"
},
"image_preprocessing": {
"patch_method": None
},
"ann_parameters": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "lucene",
"parameters": {
"ef_construction": 128,
"m": 16
}
}
},
"number_of_shards": 1,
"number_of_replicas": 0
},
"model": "hf/all_datasets_v4_MiniLM-L6"
},
"dynamic_templates": [
{
"strings": {
"match_mapping_type": "string",
"mapping": {
"type": "text"
}
}
}
],
"properties": {
"__chunks": {
"type": "nested",
"properties": {
"__field_name": {
"type": "keyword"
},
"__field_content": {
"type": "text"
},
"__vector_marqo_knn_field": {
"type": "knn_vector",
"dimension": 384,
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "lucene",
"parameters": {
"ef_construction": 128,
"m": 16
}
}
}
}
}
}
}
}
@danyilq can you add the details on the number of nodes, RAM of the nodes too, to help us better understand the issue.