k-NN icon indicating copy to clipboard operation
k-NN copied to clipboard

[BUG] Knn Search Fails When Repeatedly Deleting and Inserting Vectors.

Open danyilq opened this issue 2 years ago • 6 comments

Describe the bug When performing Knn search queries on an index multiple times, with documents being deleted and inserted, the search occasionally does not return any hits.

To Reproduce

  1. Create a Knn index.
  2. Generate a vector to be used during tests.
  3. Add a document with the vector and refresh the index.
  4. Search for that vector and retrieve the document ID.
  5. Delete the document with the retrieved ID.
  6. Repeat steps 3-5 until the search returns no hits.

Expected behavior The search query should consistently return hits as long as there are documents in the index.

Plugins Please list all plugins currently enabled.

Screenshots If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • Operating System: M1 Mac (Also occurred on Linux/ARM64)
  • OpenSearch Version: 2.7.0/2.9.0

Additional context Python script that reproduces issue:

import random

# Disable insecure request warning
requests.packages.urllib3.disable_warnings(requests.packages.urllib3.exceptions.InsecureRequestWarning)

# OpenSearch cluster configuration
opensearch_url = 'https://admin:admin@localhost:9200'
opensearch_index = ''.join(random.choice('abcdefghijklmnopqrstuvwxyz') for _ in range(10))
vec = [random.random() for _ in range(384)]


# Function to interact with OpenSearch
def opensearch_request(method, endpoint, data=None):
    url = f'{opensearch_url}/{opensearch_index}/{endpoint}'
    headers = {'Content-Type': 'application/json'}
    verify = False
    response = requests.request(method, url, json=data, headers=headers, verify=verify)
    return response

# Create the OpenSearch index


index_mapping = {...}  # Your index mapping here

print(opensearch_request('PUT', '', index_mapping).text)

# Main loop
for i in range(1000000):
    doc = {
        '__chunks': {
            '__field_name': f'field_{random.randint(1, 100)}',
            '__field_content': f'content_{random.randint(1, 100)}',
            '__vector_marqo_knn_field': vec
        }
    }

    opensearch_request('POST', '_doc', doc)
    opensearch_request('POST', '_refresh')
    knn_query = {
        "knn": {
            "__chunks.__vector_marqo_knn_field": {
                "vector": vec,
                "k": 100
            }
        }
    }

    full_knn_query = {
        "size": 100,
        "from": 0,
        "_source": {  # Exclude the vector field from the snippet
            "exclude": ["__chunks.__vector_marqo_knn_field"]
        },
        "query": {
            "nested": {
                "path": "__chunks",
                "inner_hits": {
                    "_source": {
                        "include": ["__chunks.__field_content", "__chunks.__field_name"]
                    }
                },
                "query": knn_query
            }
        }
    }

    search_results = opensearch_request('POST', '_search', full_knn_query).json()
    doc_ids = [hit['_id'] for hit in search_results.get('hits', {}).get('hits', [])]

    if doc_ids:
        print(f'Iteration {i}: {len(doc_ids)} results found. Deleting them.')
        opensearch_request('DELETE', f'_doc/{",".join(doc_ids)}')
    else:
        print(f'Iteration {i}: No results found.')
        break

to_delete_index = input("Delete the index? (y/n): ")
if to_delete_index.lower() == "y":
    opensearch_request('DELETE', '')

print("Script completed.")

danyilq avatar Aug 30 '23 06:08 danyilq

The iteration that it fails on is also consistent no matter of what vector is used, but dimensionality of vector amplifies which iteration it fails. With: 384 dimensions - 145th iteration. 512 dimensions - 109th iteration. 121 dimensions - 69th iteration. 728 dimensions - 109th iteration 1024 dimensions - 82nd iteration.

Same results are produced with ViT-L/14 and hf/all_datasets_v4_MiniLM-L6

danyilq avatar Aug 30 '23 06:08 danyilq

Does forcemerging after each deletion help?

pandu-k avatar Aug 31 '23 01:08 pandu-k

Unfortunately forcemerging didn't help

danyilq avatar Aug 31 '23 08:08 danyilq

Moved this to the k-nn repo.

dblock avatar Aug 31 '23 13:08 dblock

Index mapping that was used

{
    "settings": {
        "index": {
            "knn": True,
            "knn.algo_param.ef_search": 100,
            "refresh_interval": "1s",
            "store.hybrid.mmap.extensions": [
                "nvd", "dvd", "tim", "tip", "dim", "kdd", "kdi", "cfs", "doc", "vec", "vex"
            ]
        },
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "_meta": {
            "media_type": "text",
            "index_settings": {
                "index_defaults": {
                    "treat_urls_and_pointers_as_images": False,
                    "model": "hf/all_datasets_v4_MiniLM-L6",
                    "normalize_embeddings": True,
                    "text_preprocessing": {
                        "split_length": 2,
                        "split_overlap": 0,
                        "split_method": "sentence"
                    },
                    "image_preprocessing": {
                        "patch_method": None
                    },
                    "ann_parameters": {
                        "name": "hnsw",
                        "space_type": "cosinesimil",
                        "engine": "lucene",
                        "parameters": {
                            "ef_construction": 128,
                            "m": 16
                        }
                    }
                },
                "number_of_shards": 1,
                "number_of_replicas": 0
            },
            "model": "hf/all_datasets_v4_MiniLM-L6"
        },
        "dynamic_templates": [
            {
                "strings": {
                    "match_mapping_type": "string",
                    "mapping": {
                        "type": "text"
                    }
                }
            }
        ],
        "properties": {
            "__chunks": {
                "type": "nested",
                "properties": {
                    "__field_name": {
                        "type": "keyword"
                    },
                    "__field_content": {
                        "type": "text"
                    },
                    "__vector_marqo_knn_field": {
                        "type": "knn_vector",
                        "dimension": 384,
                        "method": {
                            "name": "hnsw",
                            "space_type": "cosinesimil",
                            "engine": "lucene",
                            "parameters": {
                                "ef_construction": 128,
                                "m": 16
                            }
                        }
                    }
                }
            }
        }
    }
}

danyilq avatar Sep 19 '23 05:09 danyilq

@danyilq can you add the details on the number of nodes, RAM of the nodes too, to help us better understand the issue.

navneet1v avatar Oct 31 '23 08:10 navneet1v