milvus
milvus copied to clipboard
[Bug]: [Nightly]Hybrid search results using RRFRanker are far from the theoretical value
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: 2586c2f
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): kafka
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
Hybrid search results using RRFRanker are far from the theoretical value.
[2024-04-25T20:51:41.207Z] @pytest.mark.tags(CaseLabel.L2)
[2024-04-25T20:51:41.207Z] @pytest.mark.parametrize("k", [1, 60, 1000, 16383])
[2024-04-25T20:51:41.207Z] @pytest.mark.parametrize("offset", [0, 1, 5])
[2024-04-25T20:51:41.207Z] def test_hybrid_search_RRFRanker_different_k(self, dim, auto_id, is_flush, enable_dynamic_field, k, offset):
[2024-04-25T20:51:41.207Z] """
[2024-04-25T20:51:41.207Z] target: test hybrid search normal case
[2024-04-25T20:51:41.207Z] method: create connection, collection, insert and search.
[2024-04-25T20:51:41.207Z] Note: here the result check is through comparing the score, the ids could not be compared
[2024-04-25T20:51:41.207Z] because the high probability of the same score, then the id is not fixed in the range of
[2024-04-25T20:51:41.207Z] the same score
[2024-04-25T20:51:41.207Z] expected: hybrid search successfully with limit(topK)
[2024-04-25T20:51:41.207Z] """
[2024-04-25T20:51:41.207Z] # 1. initialize collection with data
[2024-04-25T20:51:41.207Z] collection_w, _, _, insert_ids, time_stamp = \
[2024-04-25T20:51:41.207Z] self.init_collection_general(prefix, True, auto_id=auto_id, dim=dim, is_flush=is_flush,
[2024-04-25T20:51:41.207Z] enable_dynamic_field=False, multiple_dim_array=[dim, dim])[0:5]
[2024-04-25T20:51:41.207Z] # 2. extract vector field name
[2024-04-25T20:51:41.207Z] vector_name_list = cf.extract_vector_field_name_list(collection_w)
[2024-04-25T20:51:41.207Z] vector_name_list.append(ct.default_float_vec_field_name)
[2024-04-25T20:51:41.207Z] # 3. prepare search params for each vector field
[2024-04-25T20:51:41.207Z] req_list = []
[2024-04-25T20:51:41.207Z] search_res_dict_array = []
[2024-04-25T20:51:41.207Z] for i in range(len(vector_name_list)):
[2024-04-25T20:51:41.207Z] vectors = [[random.random() for _ in range(dim)] for _ in range(1)]
[2024-04-25T20:51:41.207Z] search_res_dict = {}
[2024-04-25T20:51:41.207Z] search_param = {
[2024-04-25T20:51:41.207Z] "data": vectors,
[2024-04-25T20:51:41.207Z] "anns_field": vector_name_list[i],
[2024-04-25T20:51:41.207Z] "param": {"metric_type": "COSINE"},
[2024-04-25T20:51:41.207Z] "limit": default_limit,
[2024-04-25T20:51:41.207Z] "expr": "int64 > 0"}
[2024-04-25T20:51:41.207Z] req = AnnSearchRequest(**search_param)
[2024-04-25T20:51:41.207Z] req_list.append(req)
[2024-04-25T20:51:41.207Z] # search for get the base line of hybrid_search
[2024-04-25T20:51:41.207Z] search_res = collection_w.search(vectors[:1], vector_name_list[i],
[2024-04-25T20:51:41.207Z] default_search_params, default_limit,
[2024-04-25T20:51:41.207Z] default_search_exp, offset=0,
[2024-04-25T20:51:41.207Z] check_task=CheckTasks.check_search_results,
[2024-04-25T20:51:41.207Z] check_items={"nq": 1,
[2024-04-25T20:51:41.207Z] "ids": insert_ids,
[2024-04-25T20:51:41.207Z] "limit": default_limit})[0]
[2024-04-25T20:51:41.207Z] ids = search_res[0].ids
[2024-04-25T20:51:41.207Z] for j in range(len(ids)):
[2024-04-25T20:51:41.207Z] search_res_dict[ids[j]] = 1/(j + k +1)
[2024-04-25T20:51:41.207Z] search_res_dict_array.append(search_res_dict)
[2024-04-25T20:51:41.207Z] # 4. calculate hybrid search base line for RRFRanker
[2024-04-25T20:51:41.207Z] ids_answer, score_answer = cf.get_hybrid_search_base_results_rrf(search_res_dict_array)
[2024-04-25T20:51:41.207Z] # 5. hybrid search
[2024-04-25T20:51:41.207Z] hybrid_res = collection_w.hybrid_search(req_list, RRFRanker(k), default_limit,
[2024-04-25T20:51:41.207Z] offset=offset,
[2024-04-25T20:51:41.207Z] check_task=CheckTasks.check_search_results,
[2024-04-25T20:51:41.207Z] check_items={"nq": 1,
[2024-04-25T20:51:41.207Z] "ids": insert_ids,
[2024-04-25T20:51:41.207Z] "limit": default_limit})[0]
[2024-04-25T20:51:41.207Z] # 6. compare results through the re-calculated distances
[2024-04-25T20:51:41.207Z] for i in range(len(score_answer[:default_limit])):
[2024-04-25T20:51:41.207Z] > assert score_answer[i] - hybrid_res[0].distances[i] < hybrid_search_epsilon
[2024-04-25T20:51:41.207Z] E assert (0.5 - 0.3333333432674408) < 0.01
Expected Behavior
pass
Steps To Reproduce
No response
Milvus Log
- link: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/723/pipeline/179
- log: artifacts-milvus-distributed-kafka-nightly-723-pymilvus-e2e-logs.tar.gz
- failed time: [2024-04-25T20:44:06.750Z] [gw4] [ 92%] FAILED testcases/test_search.py::TestCollectionHybridSearchValid::test_hybrid_search_RRFRanker_different_k[32-False-False-True-1-1]
- collection: search_collection_a8TXXaUa
Anything else?
No response