milvus [Bug]: [Nightly]Hybrid search results using RRFRanker are far from the theoretical value

[Bug]: [Nightly]Hybrid search results using RRFRanker are far from the theoretical value

Open NicoYuan1986 opened this issue 9 months ago • 0 comments

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: 2586c2f
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    kafka
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Hybrid search results using RRFRanker are far from the theoretical value.

[2024-04-25T20:51:41.207Z]     @pytest.mark.tags(CaseLabel.L2)
[2024-04-25T20:51:41.207Z]     @pytest.mark.parametrize("k", [1, 60, 1000, 16383])
[2024-04-25T20:51:41.207Z]     @pytest.mark.parametrize("offset", [0, 1, 5])
[2024-04-25T20:51:41.207Z]     def test_hybrid_search_RRFRanker_different_k(self, dim, auto_id, is_flush, enable_dynamic_field, k, offset):
[2024-04-25T20:51:41.207Z]         """
[2024-04-25T20:51:41.207Z]         target: test hybrid search normal case
[2024-04-25T20:51:41.207Z]         method: create connection, collection, insert and search.
[2024-04-25T20:51:41.207Z]                 Note: here the result check is through comparing the score, the ids could not be compared
[2024-04-25T20:51:41.207Z]                 because the high probability of the same score, then the id is not fixed in the range of
[2024-04-25T20:51:41.207Z]                 the same score
[2024-04-25T20:51:41.207Z]         expected: hybrid search successfully with limit(topK)
[2024-04-25T20:51:41.207Z]         """
[2024-04-25T20:51:41.207Z]         # 1. initialize collection with data
[2024-04-25T20:51:41.207Z]         collection_w, _, _, insert_ids, time_stamp = \
[2024-04-25T20:51:41.207Z]             self.init_collection_general(prefix, True, auto_id=auto_id, dim=dim, is_flush=is_flush,
[2024-04-25T20:51:41.207Z]                                          enable_dynamic_field=False, multiple_dim_array=[dim, dim])[0:5]
[2024-04-25T20:51:41.207Z]         # 2. extract vector field name
[2024-04-25T20:51:41.207Z]         vector_name_list = cf.extract_vector_field_name_list(collection_w)
[2024-04-25T20:51:41.207Z]         vector_name_list.append(ct.default_float_vec_field_name)
[2024-04-25T20:51:41.207Z]         # 3. prepare search params for each vector field
[2024-04-25T20:51:41.207Z]         req_list = []
[2024-04-25T20:51:41.207Z]         search_res_dict_array = []
[2024-04-25T20:51:41.207Z]         for i in range(len(vector_name_list)):
[2024-04-25T20:51:41.207Z]             vectors = [[random.random() for _ in range(dim)] for _ in range(1)]
[2024-04-25T20:51:41.207Z]             search_res_dict = {}
[2024-04-25T20:51:41.207Z]             search_param = {
[2024-04-25T20:51:41.207Z]                 "data": vectors,
[2024-04-25T20:51:41.207Z]                 "anns_field": vector_name_list[i],
[2024-04-25T20:51:41.207Z]                 "param": {"metric_type": "COSINE"},
[2024-04-25T20:51:41.207Z]                 "limit": default_limit,
[2024-04-25T20:51:41.207Z]                 "expr": "int64 > 0"}
[2024-04-25T20:51:41.207Z]             req = AnnSearchRequest(**search_param)
[2024-04-25T20:51:41.207Z]             req_list.append(req)
[2024-04-25T20:51:41.207Z]             # search for get the base line of hybrid_search
[2024-04-25T20:51:41.207Z]             search_res = collection_w.search(vectors[:1], vector_name_list[i],
[2024-04-25T20:51:41.207Z]                                              default_search_params, default_limit,
[2024-04-25T20:51:41.207Z]                                              default_search_exp, offset=0,
[2024-04-25T20:51:41.207Z]                                              check_task=CheckTasks.check_search_results,
[2024-04-25T20:51:41.207Z]                                              check_items={"nq": 1,
[2024-04-25T20:51:41.207Z]                                                           "ids": insert_ids,
[2024-04-25T20:51:41.207Z]                                                           "limit": default_limit})[0]
[2024-04-25T20:51:41.207Z]             ids = search_res[0].ids
[2024-04-25T20:51:41.207Z]             for j in range(len(ids)):
[2024-04-25T20:51:41.207Z]                 search_res_dict[ids[j]] = 1/(j + k +1)
[2024-04-25T20:51:41.207Z]             search_res_dict_array.append(search_res_dict)
[2024-04-25T20:51:41.207Z]         # 4. calculate hybrid search base line for RRFRanker
[2024-04-25T20:51:41.207Z]         ids_answer, score_answer = cf.get_hybrid_search_base_results_rrf(search_res_dict_array)
[2024-04-25T20:51:41.207Z]         # 5. hybrid search
[2024-04-25T20:51:41.207Z]         hybrid_res = collection_w.hybrid_search(req_list, RRFRanker(k), default_limit,
[2024-04-25T20:51:41.207Z]                                                 offset=offset,
[2024-04-25T20:51:41.207Z]                                                 check_task=CheckTasks.check_search_results,
[2024-04-25T20:51:41.207Z]                                                 check_items={"nq": 1,
[2024-04-25T20:51:41.207Z]                                                              "ids": insert_ids,
[2024-04-25T20:51:41.207Z]                                                              "limit": default_limit})[0]
[2024-04-25T20:51:41.207Z]         # 6. compare results through the re-calculated distances
[2024-04-25T20:51:41.207Z]         for i in range(len(score_answer[:default_limit])):
[2024-04-25T20:51:41.207Z] >           assert score_answer[i] - hybrid_res[0].distances[i] < hybrid_search_epsilon
[2024-04-25T20:51:41.207Z] E           assert (0.5 - 0.3333333432674408) < 0.01

Expected Behavior

pass

Steps To Reproduce

No response

Milvus Log

link: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI/detail/master/723/pipeline/179
log: artifacts-milvus-distributed-kafka-nightly-723-pymilvus-e2e-logs.tar.gz
failed time: [2024-04-25T20:44:06.750Z] [gw4] [ 92%] FAILED testcases/test_search.py::TestCollectionHybridSearchValid::test_hybrid_search_RRFRanker_different_k[32-False-False-True-1-1]
collection: search_collection_a8TXXaUa

Anything else?

No response

Apr 26 '24 08:04 NicoYuan1986

milvus milvus copied to clipboard

[Bug]: [Nightly]Hybrid search results using RRFRanker are far from the theoretical value

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

milvus
milvus copied to clipboard