milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Search result length is not equal to the limit(topK) value after reinstallation

Open zhuwenxing opened this issue 1 year ago • 3 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:2.2.0-20230601-5710752f
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): kafka   
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2023-06-01T13:05:03.722Z] 2023-06-01 13:05:03.302 | INFO     | MainThread |utils:load_and_search:206 - collection name: task_2_IVF_PQ

[2023-06-01T13:05:03.722Z] 2023-06-01 13:05:03.302 | INFO     | MainThread |utils:load_and_search:207 - load collection

[2023-06-01T13:05:03.722Z] 2023-06-01 13:05:03.309 | INFO     | MainThread |utils:load_and_search:211 - load time: 0.0070

[2023-06-01T13:05:03.722Z] 2023-06-01 13:05:03.320 | INFO     | MainThread |utils:load_and_search:225 - {'metric_type': 'L2', 'params': {'nprobe': 10}}

[2023-06-01T13:05:03.722Z] 2023-06-01 13:05:03.320 | INFO     | MainThread |utils:load_and_search:228 - 

[2023-06-01T13:05:03.722Z] Search...

[2023-06-01T13:05:03.722Z] 2023-06-01 13:05:03.327 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 930, distance: 28.98775291442871, entity: {'count': 930, 'random_value': -13.0}

[2023-06-01T13:05:03.722Z] 2023-06-01 13:05:03.327 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 2343, distance: 31.38789176940918, entity: {'count': 2343, 'random_value': -16.0}

[2023-06-01T13:05:03.722Z] 2023-06-01 13:05:03.327 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 1325, distance: 31.5164852142334, entity: {'count': 1325, 'random_value': -15.0}

[2023-06-01T13:05:03.722Z] 2023-06-01 13:05:03.327 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 2867, distance: 32.024906158447266, entity: {'count': 2867, 'random_value': -18.0}

[2023-06-01T13:05:03.722Z] Traceback (most recent call last):

[2023-06-01T13:05:03.722Z]   File "scripts/action_after_reinstall.py", line 47, in <module>

[2023-06-01T13:05:03.722Z]     task_2(data_size, host)

[2023-06-01T13:05:03.722Z]   File "scripts/action_after_reinstall.py", line 29, in task_2

[2023-06-01T13:05:03.722Z]     load_and_search(prefix)

[2023-06-01T13:05:03.722Z]   File "/home/jenkins/agent/workspace/tests/python_client/deploy/scripts/utils.py", line 241, in load_and_search

[2023-06-01T13:05:03.722Z]     assert len(ids) == topK, f"get {len(ids)} results, but topK is {topK}"

[2023-06-01T13:05:03.722Z] AssertionError: get 4 results, but topK is 5

Expected Behavior

len(ids) == topK

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release_cron/detail/deploy_test_kafka_for_release_cron/993/pipeline

log:

artifacts-kafka-cluster-reinstall-993-server-first-deployment-logs.tar.gz

artifacts-kafka-cluster-reinstall-993-server-second-deployment-logs.tar.gz

artifacts-kafka-cluster-reinstall-993-pytest-logs.tar.gz

Anything else?

No response

zhuwenxing avatar Jun 02 '23 03:06 zhuwenxing

/assign @jiaoew1991 /unassign

yanliang567 avatar Jun 02 '23 04:06 yanliang567

/assign @chyezh

xiaofan-luan avatar Jun 02 '23 06:06 xiaofan-luan

it seems that there's no data loss after reinstallation. image all data has been flushed, so the problem cannot be caused by growing segments.

the problem may arise in the computational logic with special input, I will try to reproduce it.

chyezh avatar Jun 08 '23 03:06 chyezh

version: 2.2.0-20230612-ae2fe478

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release_cron/detail/deploy_test_kafka_for_release_cron/1044/pipeline

[2023-06-12T13:05:57.358Z] 2023-06-12 13:05:57.198 | INFO     | MainThread |utils:load_and_search:206 - collection name: task_1_IVF_FLAT

[2023-06-12T13:05:57.358Z] 2023-06-12 13:05:57.198 | INFO     | MainThread |utils:load_and_search:207 - load collection

[2023-06-12T13:05:57.358Z] 2023-06-12 13:05:57.203 | INFO     | MainThread |utils:load_and_search:211 - load time: 0.0050

[2023-06-12T13:05:57.358Z] 2023-06-12 13:05:57.216 | INFO     | MainThread |utils:load_and_search:225 - {'metric_type': 'L2', 'params': {'nprobe': 10}}

[2023-06-12T13:05:57.358Z] 2023-06-12 13:05:57.216 | INFO     | MainThread |utils:load_and_search:228 - 

[2023-06-12T13:05:57.358Z] Search...

[2023-06-12T13:05:57.358Z] 2023-06-12 13:05:57.220 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 976, distance: 29.795345306396484, entity: {'count': 976, 'random_value': -15.0}

[2023-06-12T13:05:57.358Z] 2023-06-12 13:05:57.221 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 766, distance: 30.546741485595703, entity: {'count': 766, 'random_value': -11.0}

[2023-06-12T13:05:57.358Z] 2023-06-12 13:05:57.221 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 2403, distance: 31.58251953125, entity: {'count': 2403, 'random_value': -17.0}

[2023-06-12T13:05:57.358Z] 2023-06-12 13:05:57.221 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 2486, distance: 32.51908874511719, entity: {'count': 2486, 'random_value': -12.0}

[2023-06-12T13:05:57.358Z] Traceback (most recent call last):

[2023-06-12T13:05:57.358Z]   File "scripts/action_after_reinstall.py", line 46, in <module>

[2023-06-12T13:05:57.358Z]     task_1(data_size, host)

[2023-06-12T13:05:57.358Z]   File "scripts/action_after_reinstall.py", line 14, in task_1

[2023-06-12T13:05:57.358Z]     load_and_search(prefix)

[2023-06-12T13:05:57.358Z]   File "/home/jenkins/agent/workspace/tests/python_client/deploy/scripts/utils.py", line 241, in load_and_search

[2023-06-12T13:05:57.358Z]     assert len(ids) == topK, f"get {len(ids)} results, but topK is {topK}"

[2023-06-12T13:05:57.358Z] AssertionError: get 4 results, but topK is 5

log:

artifacts-kafka-standalone-reinstall-1044-pytest-logs.tar.gz

Uploading artifacts-kafka-standalone-reinstall-1044-server-first-deployment-logs.tar.gz…

artifacts-kafka-standalone-reinstall-1044-server-second-deployment-logs.tar.gz

zhuwenxing avatar Jun 13 '23 02:06 zhuwenxing

/assign @congqixia please take a look. the search or query result is partial.

zhuwenxing avatar Jun 13 '23 03:06 zhuwenxing

It reproduced again with image tag 2.2.0-20230707-511173a0 failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release_cron/detail/deploy_test_kafka_for_release_cron/1179/pipeline

log:

artifacts-kafka-cluster-reinstall-1179-pytest-logs.tar.gz artifacts-kafka-cluster-reinstall-1179-server-first-deployment-logs.tar.gz artifacts-kafka-cluster-reinstall-1179-server-second-deployment-logs.tar.gz

zhuwenxing avatar Jul 10 '23 02:07 zhuwenxing

Setup

  • CollectionName: task_1_IVF_FLAT
  • dim = 128,num [0, 1]
  • {"index_type": "IVF_FLAT", "params": {"nlist": 128}, "metric_type": "L2"}
  • Average cluster size: 6000/128= 46.875
  • Search operation:
    • search_vec: [1;128], search_param: {nprobe: 10},
    • filter: count > 500 ( about 11/12 entities hits ).

Debug

No segments lost here

  • After reinstall milvus, there's only one segment searched.
  • segmentID: 442736548730561398 image
  • Before reinstall milvus, there're 13 segments searched. image
  • Compact happend before reinstall, and there's no segment lost. image

The difference of two search operation: Using Index after reinstalling, Not using index before reinstalling

  • Before reinstall milvus, only one segment 442736548729940644 uses index, others do not. image
  • After reinstall milvus, fully using index on one segment. image

Is that possible, by using IVF_FLAT, 10 vector was recalled in 10 cluster in IVF, but filter the 6 vector by expr count > 500? the search vector is [1,1,1,1,....] locating the corner of the vector space.

chyezh avatar Jul 14 '23 07:07 chyezh

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Sep 05 '23 00:09 stale[bot]

@zhuwenxing @chyezh any updates

yanliang567 avatar Sep 05 '23 01:09 yanliang567

image: 2.3.0-20230918-dde27711-amd64

[2023-09-18T13:38:15.243Z] 2023-09-18 13:38:15.140 | INFO     | MainThread |utils:load_and_search:259 - ###########

[2023-09-18T13:38:15.243Z] 2023-09-18 13:38:15.143 | INFO     | MainThread |utils:load_and_search:206 - collection name: task_2_IVF_FLAT

[2023-09-18T13:38:15.243Z] 2023-09-18 13:38:15.143 | INFO     | MainThread |utils:load_and_search:207 - load collection

[2023-09-18T13:38:19.400Z] 2023-09-18 13:38:19.232 | INFO     | MainThread |utils:load_and_search:211 - load time: 4.0887

[2023-09-18T13:38:19.400Z] 2023-09-18 13:38:19.243 | INFO     | MainThread |utils:load_and_search:225 - {'metric_type': 'L2', 'params': {'nprobe': 10}}

[2023-09-18T13:38:19.400Z] 2023-09-18 13:38:19.243 | INFO     | MainThread |utils:load_and_search:228 - 

[2023-09-18T13:38:19.400Z] Search...

[2023-09-18T13:38:19.655Z] 2023-09-18 13:38:19.423 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 764, distance: 30.432262420654297, entity: {'count': 764, 'random_value': -18.0}

[2023-09-18T13:38:19.655Z] 2023-09-18 13:38:19.423 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 2455, distance: 31.647565841674805, entity: {'count': 2455, 'random_value': -17.0}

[2023-09-18T13:38:19.655Z] 2023-09-18 13:38:19.423 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 2424, distance: 32.878353118896484, entity: {'count': 2424, 'random_value': -17.0}

[2023-09-18T13:38:19.655Z] 2023-09-18 13:38:19.423 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 2737, distance: 33.31123733520508, entity: {'count': 2737, 'random_value': -14.0}

[2023-09-18T13:38:19.655Z] Traceback (most recent call last):

[2023-09-18T13:38:19.655Z]   File "scripts/action_after_reinstall.py", line 47, in <module>

[2023-09-18T13:38:19.655Z]     task_2(data_size, host)

[2023-09-18T13:38:19.655Z]   File "scripts/action_after_reinstall.py", line 33, in task_2

[2023-09-18T13:38:19.655Z]     load_and_search(prefix)

[2023-09-18T13:38:19.655Z]   File "/home/jenkins/agent/workspace/tests/python_client/deploy/scripts/utils.py", line 241, in load_and_search

[2023-09-18T13:38:19.655Z]     assert len(ids) == topK, f"get {len(ids)} results, but topK is {topK}"

[2023-09-18T13:38:19.655Z] AssertionError: get 4 results, but topK is 5

failed job:https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release_cron/detail/deploy_test_kafka_for_release_cron/1446/pipeline

log: artifacts-kafka-standalone-reinstall-1450-pytest-logs.tar.gz artifacts-kafka-standalone-reinstall-1450-server-first-deployment-logs.tar.gz artifacts-kafka-standalone-reinstall-1450-server-second-deployment-logs.tar.gz

zhuwenxing avatar Sep 19 '23 02:09 zhuwenxing

failed again failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/deploy_test_kafka_for_release_cron/detail/deploy_test_kafka_for_release_cron/1446/pipeline log: artifacts-kafka-standalone-reinstall-1446-pytest-logs (1).tar.gz artifacts-kafka-standalone-reinstall-1446-server-first-deployment-logs (1).tar.gz artifacts-kafka-standalone-reinstall-1446-server-second-deployment-logs (1).tar.gz

[2023-09-18T13:38:15.243Z] 2023-09-18 13:38:15.140 | INFO     | MainThread |utils:load_and_search:257 - query latency: 0.0047s

[2023-09-18T13:38:15.243Z] 2023-09-18 13:38:15.140 | INFO     | MainThread |utils:load_and_search:259 - ###########

[2023-09-18T13:38:15.243Z] 2023-09-18 13:38:15.143 | INFO     | MainThread |utils:load_and_search:206 - collection name: task_2_IVF_FLAT

[2023-09-18T13:38:15.243Z] 2023-09-18 13:38:15.143 | INFO     | MainThread |utils:load_and_search:207 - load collection

[2023-09-18T13:38:19.400Z] 2023-09-18 13:38:19.232 | INFO     | MainThread |utils:load_and_search:211 - load time: 4.0887

[2023-09-18T13:38:19.400Z] 2023-09-18 13:38:19.243 | INFO     | MainThread |utils:load_and_search:225 - {'metric_type': 'L2', 'params': {'nprobe': 10}}

[2023-09-18T13:38:19.400Z] 2023-09-18 13:38:19.243 | INFO     | MainThread |utils:load_and_search:228 - 

[2023-09-18T13:38:19.400Z] Search...

[2023-09-18T13:38:19.655Z] 2023-09-18 13:38:19.423 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 764, distance: 30.432262420654297, entity: {'count': 764, 'random_value': -18.0}

[2023-09-18T13:38:19.655Z] 2023-09-18 13:38:19.423 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 2455, distance: 31.647565841674805, entity: {'count': 2455, 'random_value': -17.0}

[2023-09-18T13:38:19.655Z] 2023-09-18 13:38:19.423 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 2424, distance: 32.878353118896484, entity: {'count': 2424, 'random_value': -17.0}

[2023-09-18T13:38:19.655Z] 2023-09-18 13:38:19.423 | INFO     | MainThread |utils:load_and_search:239 - hit: id: 2737, distance: 33.31123733520508, entity: {'count': 2737, 'random_value': -14.0}

[2023-09-18T13:38:19.655Z] Traceback (most recent call last):

[2023-09-18T13:38:19.655Z]   File "scripts/action_after_reinstall.py", line 47, in <module>

[2023-09-18T13:38:19.655Z]     task_2(data_size, host)

[2023-09-18T13:38:19.655Z]   File "scripts/action_after_reinstall.py", line 33, in task_2

[2023-09-18T13:38:19.655Z]     load_and_search(prefix)

[2023-09-18T13:38:19.655Z]   File "/home/jenkins/agent/workspace/tests/python_client/deploy/scripts/utils.py", line 241, in load_and_search

[2023-09-18T13:38:19.655Z]     assert len(ids) == topK, f"get {len(ids)} results, but topK is {topK}"

[2023-09-18T13:38:19.655Z] AssertionError: get 4 results, but topK is 5

zhuwenxing avatar Sep 21 '23 03:09 zhuwenxing

I have reproduced the same problem with rocksmq in no-chaos environment.

In these test case, new 3000 vectors is always inserted with same primary key (field count) as existed vectors after reinstallation. image

On searching, there's one segment. Some vectors with same primary key in ivf index was returned from these segment, and was deduplicated at reduced time. d2a03593-72fd-4f90-9254-b0237b9839f5 It's expected case under current Milvus implementation, but not a bug. Please modify the test case to avoid duplicate primary key in these test.

/assign @zhuwenxing /unassign

chyezh avatar Sep 22 '23 02:09 chyezh