milvus
milvus copied to clipboard
[Bug]: Range search return no results used with expr
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: v2.3.3
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 2.3.3.post1.dev4
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
Range search return no results used with expr. same collection: param = {"metric_type": "COSINE", "params": {"search_list": 10000}}
expr: 0 <= id < 10000, return 10000 results expr: 10000 <= id < 20000, return 10000 results expr: 20000 <= id < 30000, return 10000 results expr: 30000 <= id < 40000, return 10000 results expr: 40000 <= id < 50000, return 10000 results ... search total return 10000000 results
param = {"metric_type": "COSINE", "params": {"search_list": 10000, "radius": -1, "range_filter": 1}}
expr: 0 <= id < 10000, return 0 results expr: 10000 <= id < 20000, return 0 results expr: 20000 <= id < 30000, return 0 results expr: 30000 <= id < 40000, return 0 results expr: 40000 <= id < 50000, return 0 results expr: 50000 <= id < 60000, return 0 results expr: 60000 <= id < 70000, return 0 results expr: 70000 <= id < 80000, return 0 results expr: 80000 <= id < 90000, return 0 results expr: 90000 <= id < 100000, return 0 results ...
search total return 0 results.
If I don't use expr, the result will not be empty for range search.
Expected Behavior
Resolve conflicts with expr
Steps To Reproduce
No response
Milvus Log
No response
Anything else?
No response
reproduced on the latest 2.3.x version. 2.3-20231129-a3aceb97
from pymilvus import CollectionSchema, FieldSchema, Collection, connections, DataType, Partition, utility
import numpy as np
import random
nb = 20000
nq = 1
dim = 64
int64_field = FieldSchema(name="int64", dtype=DataType.INT64, is_primary=True)
float_field = FieldSchema(name="float", dtype=DataType.FLOAT)
float_vector = FieldSchema(name="float_vector", dtype=DataType.FLOAT_VECTOR, dim=dim)
schema = CollectionSchema(fields=[int64_field, float_field, float_vector])
connections.connect(host="", port="19530")
collection = Collection("test_diskann", schema=schema)
vectors = [[random.random() for _ in range(dim)] for _ in range(nb)]
collection.insert([[i for i in range(nb)], [np.float32(i) for i in range(nb)], vectors])
collection.flush()
index = {"index_type": "DISKANN", "metric_type": "COSINE", "params": {}}
collection.create_index("float_vector", index)
collection.load()
limit = 1000
search_params = {"metric_type": "COSINE", "params": {"search_list": 1000}}
expr = "0 <= int64 < 1000"
res = collection.search(vectors[:nq], "float_vector", search_params, limit, expr)
print(len(res[0]))
search_params = {"metric_type": "COSINE", "params": {"search_list": 1000, "radius": -1, "range_filter": 1}}
res = collection.search(vectors[:nq], "float_vector", search_params, limit, expr)
print(len(res[0]))
Same situation on scann index.
>>>
>>> limit = 1000
>>> search_params = {"metric_type": "COSINE", "params": {"nprobe": 1000, "reorder_k": 1000}}
>>> expr = "0 <= int64 < 1000"
>>> res = collection.search(vectors[:nq], "float_vector", search_params, limit, expr)
>>> print(len(res[0]))
1000
>>>
>>> search_params = {"metric_type": "COSINE", "params": {"nprobe": 1000, "reorder_k": 1000, "radius": -1, "range_filter": 1}}
>>> res = collection.search(vectors[:nq], "float_vector", search_params, limit, expr)
>>> print(len(res[0]))
4
/assign @liliu-z it seems that it reproduces on hnsw, diskann and scann, please help to check. similar to #28821 /unassign
/assign @cydrain
may have something to do with #28810
COSINE metric type should get distance in scope [-1.0, 1.0], but DISKANN with COSINE get distance 2.0, need @hhy3 help to check.
reproduce script:
from pymilvus import CollectionSchema, FieldSchema, Collection, connections, DataType, Partition, utility
import numpy as np
import random
nb = 20000
nq = 1
dim = 64
INDEX_TYPE_ = "DISKANN"
METRIC_TYPE_ = "COSINE"
def main():
int64_field = FieldSchema(name="int64", dtype=DataType.INT64, is_primary=True)
float_field = FieldSchema(name="float", dtype=DataType.FLOAT)
float_vector = FieldSchema(name="float_vector", dtype=DataType.FLOAT_VECTOR, dim=dim)
schema = CollectionSchema(fields=[int64_field, float_field, float_vector])
connections.connect(host="", port="19530")
collection = Collection("test_diskann", schema=schema)
vectors = [[random.random() for _ in range(dim)] for _ in range(nb)]
collection.insert([[i for i in range(nb)], [np.float32(i) for i in range(nb)], vectors])
collection.flush()
index = {"index_type": INDEX_TYPE_, "metric_type": METRIC_TYPE_, "params": {"efConstruction": 360, "M": 30}}
collection.create_index("float_vector", index)
collection.load()
limit = 1000
search_params = {"metric_type": METRIC_TYPE_, "params": {"search_list": 1000, "ef": 1000}}
expr = "0 <= int64 < 1000"
res = collection.search(vectors[:nq], "float_vector", search_params, limit, expr)
print(len(res[0]))
print(res[0][0])
print(res[0][-1])
search_params = {"metric_type": METRIC_TYPE_, "params": {"search_list": 1000, "radius": -1, "range_filter": 1}}
res = collection.search(vectors[:nq], "float_vector", search_params, limit, expr)
print(len(res[0]))
collection.drop()
if __name__ == '__main__':
main()
we can see COSINE distance 1.99999
id: 0, distance: 1.9999998807907104, entity: {} id: 17, distance: 1.5845264196395874, entity: {}
/assign @hhy3
Hi @NicoYuan1986, please check this issue again
Too slow. master-20231213-fe1eeae2 Search needs 8s, range search needs 1107s(18.45min). Is that expected? @cydrain @liliu-z Because of this, search iterator will take a huge time. every page will take 18min.
for search iterator, the risk of timeout increases.
Hi @NicoYuan1986 , it does not make sense to set "radius" to -1, with this param DISKANN range search will almost return all data like brute force. So you see the performance is bad. Please set the radius to a meaningful value, then compare its performance with search.
For range search. Change the value of 'radius', and the performance will get better.
Search: 2.9s range search: 0.98s
However, for search iterator, it is still not good.
The average time for the first 10 pages is 437s. It is unreasonable. @cydrain @yanliang567
that doesn't make sense to me?
/assign MrPresent-Han
page 1, 10000 results, cost 0.0047s
page 2, 10000 results, cost 1553.5283s
page 3, 10000 results, cost 15.8129s
page 4, 10000 results, cost 0.0063s
page 5, 10000 results, cost 10.1714s
page 6, 10000 results, cost 15.517s
page 7, 10000 results, cost 0.0067s
page 8, 10000 results, cost 10.5434s
page 9, 10000 results, cost 22.8054s
page 10, 10000 results, cost 24.5598s
page 11, 10000 results, cost 0.0055s
page 12, 10000 results, cost 24.049s
page 13, 10000 results, cost 23.2669s
page 14, 10000 results, cost 20.3911s
page 15, 10000 results, cost 27.6964s
page 16, 10000 results, cost 30.8896s
page 17, 10000 results, cost 30.6418s
page 18, 10000 results, cost 28.0768s
page 19, 10000 results, cost 65.4702s
page 20, 10000 results, cost 57.9992s
page 21, 10000 results, cost 62.9765s
page 22, 10000 results, cost 0.0053s
page 23, 10000 results, cost 63.2601s
page 24, 10000 results, cost 60.7237s
page 25, 10000 results, cost 62.3825s
page 26, 10000 results, cost 61.3414s
page 27, 10000 results, cost 63.2385s
page 28, 10000 results, cost 62.8683s
page 29, 10000 results, cost 64.5355s
page 30, 10000 results, cost 63.026s
page 31, 10000 results, cost 66.8551s
page 32, 10000 results, cost 86.7241s
page 33, 10000 results, cost 76.507s
page 34, 10000 results, cost 55.3079s
page 35, 10000 results, cost 66.416s
page 36, 10000 results, cost 66.4382s
page 37, 10000 results, cost 66.787s
page 38, 10000 results, cost 66.503s
page 39, 10000 results, cost 66.2221s
page 40, 10000 results, cost 67.2244s
page 41, 10000 results, cost 68.0742s
page 42, 10000 results, cost 68.237s
page 43, 10000 results, cost 68.6832s
page 44, 10000 results, cost 68.9287s
page 45, 10000 results, cost 71.0643s
page 46, 10000 results, cost 70.8987s
page 47, 10000 results, cost 69.6963s
page 48, 10000 results, cost 71.19s
page 49, 10000 results, cost 70.5984s
page 50, 10000 results, cost 71.1868s
@MrPresent-Han Seems much better than before. version: pymilvus==2.4.0rc7
@cydrain
could it be much faster if we change it to cache the iterator
/assign @cydrain was this ready for verification? /unassign @MrPresent-Han
@cydrain
could it be much faster if we change it to cache the iterator
Hi @xiaofan-luan , this iterator is implemented in pymilvus, it uses the range search result of the previous run to update the radius and range_filter for current run, we cannot use cache here.
/assign @MrPresent-Han any ideas? /unassign @cydrain
no more comments
@cydrain @xiaofan-luan @MrPresent-Han So the current behavior above is the best you can offer for now?
@cydrain @MrPresent-Han
why is it take that long for only 10000 datas? page 2, 10000 results, cost 1553.5283s
I'm thinking of 1s or hundred milliseconds might be reasonable. is this the right benchmark result?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.
keep it
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen
.