milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Range search return no results used with expr

Open NicoYuan1986 opened this issue 1 year ago • 25 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: v2.3.3
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):     pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 2.3.3.post1.dev4
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Range search return no results used with expr. same collection: param = {"metric_type": "COSINE", "params": {"search_list": 10000}}

expr: 0 <= id < 10000, return 10000 results expr: 10000 <= id < 20000, return 10000 results expr: 20000 <= id < 30000, return 10000 results expr: 30000 <= id < 40000, return 10000 results expr: 40000 <= id < 50000, return 10000 results ... search total return 10000000 results

param = {"metric_type": "COSINE", "params": {"search_list": 10000, "radius": -1, "range_filter": 1}}

expr: 0 <= id < 10000, return 0 results expr: 10000 <= id < 20000, return 0 results expr: 20000 <= id < 30000, return 0 results expr: 30000 <= id < 40000, return 0 results expr: 40000 <= id < 50000, return 0 results expr: 50000 <= id < 60000, return 0 results expr: 60000 <= id < 70000, return 0 results expr: 70000 <= id < 80000, return 0 results expr: 80000 <= id < 90000, return 0 results expr: 90000 <= id < 100000, return 0 results ...

search total return 0 results.

If I don't use expr, the result will not be empty for range search.

Expected Behavior

Resolve conflicts with expr

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

NicoYuan1986 avatar Nov 29 '23 07:11 NicoYuan1986

reproduced on the latest 2.3.x version. 2.3-20231129-a3aceb97

from pymilvus import CollectionSchema, FieldSchema, Collection, connections, DataType, Partition, utility
import numpy as np
import random

nb = 20000
nq = 1
dim = 64
int64_field = FieldSchema(name="int64", dtype=DataType.INT64, is_primary=True)
float_field = FieldSchema(name="float", dtype=DataType.FLOAT)
float_vector = FieldSchema(name="float_vector", dtype=DataType.FLOAT_VECTOR, dim=dim)
schema = CollectionSchema(fields=[int64_field, float_field, float_vector])

connections.connect(host="", port="19530")
collection = Collection("test_diskann", schema=schema)

vectors = [[random.random() for _ in range(dim)] for _ in range(nb)]
collection.insert([[i for i in range(nb)], [np.float32(i) for i in range(nb)], vectors])
collection.flush()
index = {"index_type": "DISKANN", "metric_type": "COSINE", "params": {}}
collection.create_index("float_vector", index)
collection.load()

limit = 1000
search_params = {"metric_type": "COSINE", "params": {"search_list": 1000}}
expr = "0 <= int64 < 1000"
res = collection.search(vectors[:nq], "float_vector", search_params, limit, expr)
print(len(res[0]))

search_params = {"metric_type": "COSINE", "params": {"search_list": 1000, "radius": -1, "range_filter": 1}}
res = collection.search(vectors[:nq], "float_vector", search_params, limit, expr)
print(len(res[0]))

NicoYuan1986 avatar Nov 29 '23 07:11 NicoYuan1986

Same situation on scann index.

>>> 
>>> limit = 1000
>>> search_params = {"metric_type": "COSINE", "params": {"nprobe": 1000, "reorder_k": 1000}}
>>> expr = "0 <= int64 < 1000"
>>> res = collection.search(vectors[:nq], "float_vector", search_params, limit, expr)
>>> print(len(res[0]))
1000
>>> 
>>> search_params = {"metric_type": "COSINE", "params": {"nprobe": 1000, "reorder_k": 1000, "radius": -1, "range_filter": 1}}
>>> res = collection.search(vectors[:nq], "float_vector", search_params, limit, expr)
>>> print(len(res[0]))
4

NicoYuan1986 avatar Nov 29 '23 08:11 NicoYuan1986

/assign @liliu-z it seems that it reproduces on hnsw, diskann and scann, please help to check. similar to #28821 /unassign

yanliang567 avatar Nov 29 '23 08:11 yanliang567

/assign @cydrain

liliu-z avatar Nov 30 '23 01:11 liliu-z

may have something to do with #28810

NicoYuan1986 avatar Dec 04 '23 02:12 NicoYuan1986

COSINE metric type should get distance in scope [-1.0, 1.0], but DISKANN with COSINE get distance 2.0, need @hhy3 help to check.

reproduce script:

from pymilvus import CollectionSchema, FieldSchema, Collection, connections, DataType, Partition, utility
import numpy as np
import random

nb = 20000
nq = 1
dim = 64
INDEX_TYPE_ = "DISKANN"
METRIC_TYPE_ = "COSINE"

def main():
    int64_field = FieldSchema(name="int64", dtype=DataType.INT64, is_primary=True)
    float_field = FieldSchema(name="float", dtype=DataType.FLOAT)
    float_vector = FieldSchema(name="float_vector", dtype=DataType.FLOAT_VECTOR, dim=dim)
    schema = CollectionSchema(fields=[int64_field, float_field, float_vector])

    connections.connect(host="", port="19530")
    collection = Collection("test_diskann", schema=schema)

    vectors = [[random.random() for _ in range(dim)] for _ in range(nb)]
    collection.insert([[i for i in range(nb)], [np.float32(i) for i in range(nb)], vectors])
    collection.flush()
    index = {"index_type": INDEX_TYPE_, "metric_type": METRIC_TYPE_, "params": {"efConstruction": 360, "M": 30}}
    collection.create_index("float_vector", index)
    collection.load()

    limit = 1000
    search_params = {"metric_type": METRIC_TYPE_, "params": {"search_list": 1000, "ef": 1000}}
    expr = "0 <= int64 < 1000"
    res = collection.search(vectors[:nq], "float_vector", search_params, limit, expr)
    print(len(res[0]))
    print(res[0][0])
    print(res[0][-1])

    search_params = {"metric_type": METRIC_TYPE_, "params": {"search_list": 1000, "radius": -1, "range_filter": 1}}
    res = collection.search(vectors[:nq], "float_vector", search_params, limit, expr)
    print(len(res[0]))

    collection.drop()

if __name__ == '__main__':
    main()

we can see COSINE distance 1.99999

id: 0, distance: 1.9999998807907104, entity: {} id: 17, distance: 1.5845264196395874, entity: {}

cydrain avatar Dec 04 '23 10:12 cydrain

/assign @hhy3

cydrain avatar Dec 04 '23 10:12 cydrain

Hi @NicoYuan1986, please check this issue again

cydrain avatar Dec 12 '23 02:12 cydrain

Too slow. master-20231213-fe1eeae2 Search needs 8s, range search needs 1107s(18.45min). Is that expected? @cydrain @liliu-z Because of this, search iterator will take a huge time. every page will take 18min.

image

for search iterator, the risk of timeout increases.

NicoYuan1986 avatar Dec 14 '23 07:12 NicoYuan1986

Hi @NicoYuan1986 , it does not make sense to set "radius" to -1, with this param DISKANN range search will almost return all data like brute force. So you see the performance is bad. Please set the radius to a meaningful value, then compare its performance with search.

cydrain avatar Dec 15 '23 08:12 cydrain

For range search. Change the value of 'radius', and the performance will get better. image Search: 2.9s range search: 0.98s

However, for search iterator, it is still not good. image

The average time for the first 10 pages is 437s. It is unreasonable. @cydrain @yanliang567

NicoYuan1986 avatar Dec 16 '23 02:12 NicoYuan1986

that doesn't make sense to me?

xiaofan-luan avatar Dec 17 '23 13:12 xiaofan-luan

/assign MrPresent-Han

MrPresent-Han avatar Dec 18 '23 03:12 MrPresent-Han

/assign NicoYuan1986

can you help to verify this modification?

MrPresent-Han avatar Dec 22 '23 02:12 MrPresent-Han

page 1, 10000 results, cost 0.0047s
page 2, 10000 results, cost 1553.5283s
page 3, 10000 results, cost 15.8129s
page 4, 10000 results, cost 0.0063s
page 5, 10000 results, cost 10.1714s
page 6, 10000 results, cost 15.517s
page 7, 10000 results, cost 0.0067s
page 8, 10000 results, cost 10.5434s
page 9, 10000 results, cost 22.8054s
page 10, 10000 results, cost 24.5598s
page 11, 10000 results, cost 0.0055s
page 12, 10000 results, cost 24.049s
page 13, 10000 results, cost 23.2669s
page 14, 10000 results, cost 20.3911s
page 15, 10000 results, cost 27.6964s
page 16, 10000 results, cost 30.8896s
page 17, 10000 results, cost 30.6418s
page 18, 10000 results, cost 28.0768s
page 19, 10000 results, cost 65.4702s
page 20, 10000 results, cost 57.9992s
page 21, 10000 results, cost 62.9765s
page 22, 10000 results, cost 0.0053s
page 23, 10000 results, cost 63.2601s
page 24, 10000 results, cost 60.7237s
page 25, 10000 results, cost 62.3825s
page 26, 10000 results, cost 61.3414s
page 27, 10000 results, cost 63.2385s
page 28, 10000 results, cost 62.8683s
page 29, 10000 results, cost 64.5355s
page 30, 10000 results, cost 63.026s
page 31, 10000 results, cost 66.8551s
page 32, 10000 results, cost 86.7241s
page 33, 10000 results, cost 76.507s
page 34, 10000 results, cost 55.3079s
page 35, 10000 results, cost 66.416s
page 36, 10000 results, cost 66.4382s
page 37, 10000 results, cost 66.787s
page 38, 10000 results, cost 66.503s
page 39, 10000 results, cost 66.2221s
page 40, 10000 results, cost 67.2244s
page 41, 10000 results, cost 68.0742s
page 42, 10000 results, cost 68.237s
page 43, 10000 results, cost 68.6832s
page 44, 10000 results, cost 68.9287s
page 45, 10000 results, cost 71.0643s
page 46, 10000 results, cost 70.8987s
page 47, 10000 results, cost 69.6963s
page 48, 10000 results, cost 71.19s
page 49, 10000 results, cost 70.5984s
page 50, 10000 results, cost 71.1868s

@MrPresent-Han Seems much better than before. version: pymilvus==2.4.0rc7

NicoYuan1986 avatar Dec 22 '23 06:12 NicoYuan1986

@cydrain

could it be much faster if we change it to cache the iterator

xiaofan-luan avatar Dec 22 '23 10:12 xiaofan-luan

/assign @cydrain was this ready for verification? /unassign @MrPresent-Han

yanliang567 avatar Mar 05 '24 03:03 yanliang567

@cydrain

could it be much faster if we change it to cache the iterator

Hi @xiaofan-luan , this iterator is implemented in pymilvus, it uses the range search result of the previous run to update the radius and range_filter for current run, we cannot use cache here.

cydrain avatar Mar 11 '24 07:03 cydrain

/assign @MrPresent-Han any ideas? /unassign @cydrain

yanliang567 avatar Mar 11 '24 07:03 yanliang567

no more comments

MrPresent-Han avatar Mar 11 '24 08:03 MrPresent-Han

@cydrain @xiaofan-luan @MrPresent-Han So the current behavior above is the best you can offer for now?

yanliang567 avatar Mar 11 '24 09:03 yanliang567

@cydrain @MrPresent-Han

why is it take that long for only 10000 datas? page 2, 10000 results, cost 1553.5283s

xiaofan-luan avatar Mar 11 '24 17:03 xiaofan-luan

I'm thinking of 1s or hundred milliseconds might be reasonable. is this the right benchmark result?

xiaofan-luan avatar Mar 11 '24 17:03 xiaofan-luan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Apr 13 '24 05:04 stale[bot]

keep it

yanliang567 avatar Apr 15 '24 01:04 yanliang567

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Jun 10 '24 06:06 stale[bot]