milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [benchmark] diskann inserted 100 million data, load failed, and reported "collection xxx has not been loaded to memory or load failed"

Open elstic opened this issue 1 year ago • 3 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: master-20230506-7f5294b1
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0.dev12
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The memory resources are sufficient, but the load fails and "collection xxx has not been loaded to memory or load failed" is reported.

argo task : fouramf-shtt7-kg49g resource :

    dataNode:
      replicas: 1
      resources:
        requests:
            memory: 64Gi
            cpu: 8.0
        limits:
            memory: 64Gi
            cpu: 8.0
    indexNode:
      replicas: 1
      resources:
        requests:
            memory: 64Gi
            cpu: 8.0
        limits:
            memory: 64Gi
            cpu: 8.0
            ephemeral-storage: 256Gi
    queryNode:
      replicas: 1
      resources:
        requests:
          memory: 64Gi
          cpu: 8.0
        limits:
          memory: 64Gi
          cpu: 8.0
          ephemeral-storage: 256Gi

server:

fouramf-shtt7-kg49g-97-6367-etcd-0                                1/1     Running     0               4m18s   10.104.24.172   4am-node29   <none>           <none>
fouramf-shtt7-kg49g-97-6367-etcd-1                                1/1     Running     0               4m18s   10.104.22.9     4am-node26   <none>           <none>
fouramf-shtt7-kg49g-97-6367-etcd-2                                1/1     Running     0               4m17s   10.104.6.153    4am-node13   <none>           <none>
fouramf-shtt7-kg49g-97-6367-milvus-datacoord-5bbcb78b65-kprns     1/1     Running     0               4m18s   10.104.24.167   4am-node29   <none>           <none>
fouramf-shtt7-kg49g-97-6367-milvus-datanode-55d494bd48-79tc2      1/1     Running     0               4m18s   10.104.23.92    4am-node27   <none>           <none>
fouramf-shtt7-kg49g-97-6367-milvus-indexcoord-67d56f59cd-fgmlg    1/1     Running     0               4m18s   10.104.19.67    4am-node28   <none>           <none>
fouramf-shtt7-kg49g-97-6367-milvus-indexnode-6cffd4d966-vq9jb     1/1     Running     0               4m18s   10.104.24.169   4am-node29   <none>           <none>
fouramf-shtt7-kg49g-97-6367-milvus-proxy-556c76586-qfd9h          1/1     Running     0               4m18s   10.104.24.170   4am-node29   <none>           <none>
fouramf-shtt7-kg49g-97-6367-milvus-querycoord-75497595df-7gvcl    1/1     Running     0               4m18s   10.104.24.168   4am-node29   <none>           <none>
fouramf-shtt7-kg49g-97-6367-milvus-querynode-7678576b76-xm7nz     1/1     Running     0               4m18s   10.104.19.66    4am-node28   <none>           <none>
fouramf-shtt7-kg49g-97-6367-milvus-rootcoord-746c94497c-26wrl     1/1     Running     0               4m18s   10.104.19.68    4am-node28   <none>           <none>
fouramf-shtt7-kg49g-97-6367-minio-0                               1/1     Running     0               4m17s   10.104.6.152    4am-node13   <none>           <none>
fouramf-shtt7-kg49g-97-6367-minio-1                               1/1     Running     0               4m17s   10.104.22.12    4am-node26   <none>           <none>
fouramf-shtt7-kg49g-97-6367-minio-2                               1/1     Running     0               4m17s   10.104.5.179    4am-node12   <none>           <none>
fouramf-shtt7-kg49g-97-6367-minio-3                               1/1     Running     0               4m17s   10.104.20.185   4am-node22   <none>           <none>
fouramf-shtt7-kg49g-97-6367-pulsar-bookie-0                       1/1     Running     0               4m18s   10.104.24.174   4am-node29   <none>           <none>
fouramf-shtt7-kg49g-97-6367-pulsar-bookie-1                       1/1     Running     0               4m17s   10.104.22.14    4am-node26   <none>           <none>
fouramf-shtt7-kg49g-97-6367-pulsar-bookie-2                       1/1     Running     0               4m17s   10.104.6.159    4am-node13   <none>           <none>
fouramf-shtt7-kg49g-97-6367-pulsar-bookie-init-s7gjq              0/1     Completed   0               4m18s   10.104.24.162   4am-node29   <none>           <none>
fouramf-shtt7-kg49g-97-6367-pulsar-broker-0                       1/1     Running     0               4m18s   10.104.24.160   4am-node29   <none>           <none>
fouramf-shtt7-kg49g-97-6367-pulsar-proxy-0                        1/1     Running     0               4m18s   10.104.19.69    4am-node28   <none>           <none>
fouramf-shtt7-kg49g-97-6367-pulsar-pulsar-init-6ztch              0/1     Completed   0               4m18s   10.104.24.161   4am-node29   <none>           <none>
fouramf-shtt7-kg49g-97-6367-pulsar-recovery-0                     1/1     Running     0               4m18s   10.104.22.7     4am-node26   <none>           <none>
fouramf-shtt7-kg49g-97-6367-pulsar-zookeeper-0                    1/1     Running     0               4m18s   10.104.24.173   4am-node29   <none>           <none>
fouramf-shtt7-kg49g-97-6367-pulsar-zookeeper-1                    1/1     Running     0               3m32s   10.104.5.181    4am-node12   <none>           <none>
fouramf-shtt7-kg49g-97-6367-pulsar-zookeeper-2                    1/1     Running     0               2m57s   10.104.22.16    4am-node26   <none>           <none> 

client log:

[2023-05-08 05:24:05,050 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_xBvJKZYG): 99800000 (base.py:318)
[2023-05-08 05:24:05,239 -  INFO - fouram]: [Base] Start inserting, ids: 99850000 - 99899999, data size: 100,000,000 (base.py:164)
[2023-05-08 05:24:07,001 -  INFO - fouram]: [Time] Collection.insert run in 1.7616s (api_request.py:41)
[2023-05-08 05:24:07,004 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_xBvJKZYG): 99800000 (base.py:318)
[2023-05-08 05:24:07,942 -  INFO - fouram]: [Base] Start inserting, ids: 99900000 - 99949999, data size: 100,000,000 (base.py:164)
[2023-05-08 05:24:09,767 -  INFO - fouram]: [Time] Collection.insert run in 1.8244s (api_request.py:41)
[2023-05-08 05:24:09,770 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_xBvJKZYG): 99900000 (base.py:318)
[2023-05-08 05:24:09,937 -  INFO - fouram]: [Base] Start inserting, ids: 99950000 - 99999999, data size: 100,000,000 (base.py:164)
[2023-05-08 05:24:11,600 -  INFO - fouram]: [Time] Collection.insert run in 1.6626s (api_request.py:41)
[2023-05-08 05:24:11,604 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_xBvJKZYG): 99900000 (base.py:318)
[2023-05-08 05:24:11,691 -  INFO - fouram]: [Base] Total time of insert: 3135.3359s, average number of vector bars inserted per second: 31894.5093, average time to insert 50000 vectors per time: 1.5677s (base.py:235)
[2023-05-08 05:24:11,692 -  INFO - fouram]: [Base] Start flush collection fouram_xBvJKZYG (base.py:133)
[2023-05-08 05:24:14,714 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_xBvJKZYG): 100000000 (base.py:318)
[2023-05-08 05:24:14,719 -  INFO - fouram]: [Base] Params of index: {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}} (base.py:296)
[2023-05-08 05:24:14,719 -  INFO - fouram]: [Base] Start build index of DISKANN for collection fouram_xBvJKZYG, params:{'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}} (base.py:283)
[2023-05-08 23:00:39,196 -  INFO - fouram]: [Time] Index run in 63384.4759s (api_request.py:41)
[2023-05-08 23:00:39,197 -  INFO - fouram]: [CommonCases] RT of build index DISKANN: 63384.4759s (common_cases.py:87)
[2023-05-08 23:00:39,200 -  INFO - fouram]: [Base] Params of index: {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}} (base.py:296)
[2023-05-08 23:00:39,200 -  INFO - fouram]: [CommonCases] Prepare index DISKANN done. (common_cases.py:90)
[2023-05-08 23:00:39,200 -  INFO - fouram]: [CommonCases] No scalars need to be indexed. (common_cases.py:95)
[2023-05-08 23:00:39,200 -  INFO - fouram]: [Base] Start load collection fouram_xBvJKZYG,replica_number:1,kwargs:{} (base.py:139)
[2023-05-08 23:10:41,078 - ERROR - fouram]: RPC error: [get_loading_progress], <MilvusException: (code=1, message=collection 441324366066876535 has not been loaded to memory or load failed)>, <Time:{'RPC start': '2023-05-08 23:10:41.076732', 'RPC error': '2023-05-08 23:10:41.078363'}> (decorators.py:108)
[2023-05-08 23:10:41,080 - ERROR - fouram]: RPC error: [wait_for_loading_collection], <MilvusException: (code=1, message=collection 441324366066876535 has not been loaded to memory or load failed)>, <Time:{'RPC start': '2023-05-08 23:00:39.217361', 'RPC error': '2023-05-08 23:10:41.080019'}> (decorators.py:108)
[2023-05-08 23:10:41,080 - ERROR - fouram]: RPC error: [load_collection], <MilvusException: (code=1, message=collection 441324366066876535 has not been loaded to memory or load failed)>, <Time:{'RPC start': '2023-05-08 23:00:39.200687', 'RPC error': '2023-05-08 23:10:41.080147'}> (decorators.py:108)
[2023-05-08 23:10:41,082 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=collection 441324366066876535 has not been loaded to memory or load failed)> (api_request.py:49)
[2023-05-08 23:10:41,082 - ERROR - fouram]: [CheckFunc] load request check failed, response:<MilvusException: (code=1, message=collection 441324366066876535 has not been loaded to memory or load failed)> (func_check.py:52)

memory usage: image

Expected Behavior

No response

Steps To Reproduce

1. create a collection or use an existing collection
        2. build index on vector column
        3. insert a certain number of vectors
        4. flush collection
        5. build index on vector column with the same parameters
        6. build index on on scalars column or not
        7. count the total number of rows
        8. load collection  ==> failed
        9. perform concurrent operations
        10. clean all collections or not

Milvus Log

No response

Anything else?

No response

elstic avatar May 09 '23 03:05 elstic

/assign @xige-16 /unassign

yanliang567 avatar May 09 '23 10:05 yanliang567

/assign @yah01 /unassign

jiaoew1991 avatar May 09 '23 11:05 jiaoew1991

updage: image: master-20230530-b09e7aea insert 100m data load successfully

elstic avatar Jun 01 '23 03:06 elstic

@elstic if it does not reproduce recently, please help to close it.

yanliang567 avatar Jun 07 '23 01:06 yanliang567

This issue did not appear recently

elstic avatar Jun 08 '23 02:06 elstic

The issue arises again image: master-20230504-e172f3e8

server:

fouram-93-8048-etcd-0                                             1/1     Running     0               9m21s   10.104.1.34     4am-node10   <none>           <none>
fouram-93-8048-etcd-1                                             1/1     Running     0               9m21s   10.104.16.91    4am-node21   <none>           <none>
fouram-93-8048-etcd-2                                             1/1     Running     0               9m21s   10.104.21.178   4am-node24   <none>           <none>
fouram-93-8048-milvus-datacoord-7595dbc5c4-kp887                  1/1     Running     1 (5m21s ago)   9m21s   10.104.5.96     4am-node12   <none>           <none>
fouram-93-8048-milvus-datanode-74b6bff58b-wxn5p                   1/1     Running     2 (110s ago)    9m21s   10.104.16.71    4am-node21   <none>           <none>
fouram-93-8048-milvus-indexcoord-6d5c55c6b5-zgf2p                 1/1     Running     0               9m21s   10.104.21.170   4am-node24   <none>           <none>
fouram-93-8048-milvus-indexnode-dc9dc4b9d-tmrjj                   1/1     Running     1 (5m21s ago)   9m21s   10.104.5.95     4am-node12   <none>           <none>
fouram-93-8048-milvus-proxy-ffcdfffd9-8nw9l                       1/1     Running     1 (5m21s ago)   9m21s   10.104.20.128   4am-node22   <none>           <none>
fouram-93-8048-milvus-querycoord-55d96dfd6-drpnb                  1/1     Running     1 (5m21s ago)   9m21s   10.104.6.173    4am-node13   <none>           <none>
fouram-93-8048-milvus-querynode-78fb7c9876-n9gnt                  1/1     Running     1 (5m21s ago)   9m21s   10.104.24.32    4am-node29   <none>           <none>
fouram-93-8048-milvus-rootcoord-58bd4c8778-q625p                  1/1     Running     2 (110s ago)    9m21s   10.104.1.11     4am-node10   <none>           <none>
fouram-93-8048-minio-0                                            1/1     Running     0               9m21s   10.104.1.32     4am-node10   <none>           <none>
fouram-93-8048-minio-1                                            1/1     Running     0               9m21s   10.104.16.89    4am-node21   <none>           <none>
fouram-93-8048-minio-2                                            1/1     Running     0               9m21s   10.104.20.143   4am-node22   <none>           <none>
fouram-93-8048-minio-3                                            1/1     Running     0               9m20s   10.104.9.213    4am-node14   <none>           <none>
fouram-93-8048-pulsar-bookie-0                                    1/1     Running     0               9m21s   10.104.6.196    4am-node13   <none>           <none>
fouram-93-8048-pulsar-bookie-1                                    1/1     Running     0               9m20s   10.104.20.144   4am-node22   <none>           <none>
fouram-93-8048-pulsar-bookie-2                                    1/1     Running     0               9m20s   10.104.21.181   4am-node24   <none>           <none>
fouram-93-8048-pulsar-bookie-init-9t8nt                           0/1     Completed   0               9m21s   10.104.16.70    4am-node21   <none>           <none>
fouram-93-8048-pulsar-broker-0                                    1/1     Running     0               9m21s   10.104.1.12     4am-node10   <none>           <none>
fouram-93-8048-pulsar-proxy-0                                     1/1     Running     0               9m21s   10.104.23.91    4am-node27   <none>           <none>
fouram-93-8048-pulsar-pulsar-init-lb7dz                           0/1     Completed   0               9m21s   10.104.16.69    4am-node21   <none>           <none>
fouram-93-8048-pulsar-recovery-0                                  1/1     Running     0               9m21s   10.104.6.174    4am-node13   <none>           <none>
fouram-93-8048-pulsar-zookeeper-0                                 1/1     Running     0               9m21s   10.104.20.139   4am-node22   <none>           <none>
fouram-93-8048-pulsar-zookeeper-1                                 1/1     Running     0               5m20s   10.104.15.120   4am-node20   <none>           <none>
fouram-93-8048-pulsar-zookeeper-2                                 1/1     Running     0               3m51s   10.104.21.183   4am-node24   <none>           <none>

server (after):

fouram-93-8048-etcd-0                                             1/1     Running     0               18h     10.104.1.34     4am-node10   <none>           <none>
fouram-93-8048-etcd-1                                             1/1     Running     0               18h     10.104.16.91    4am-node21   <none>           <none>
fouram-93-8048-etcd-2                                             1/1     Running     0               18h     10.104.21.178   4am-node24   <none>           <none>
fouram-93-8048-milvus-datacoord-7595dbc5c4-kp887                  1/1     Running     1 (18h ago)     18h     10.104.5.96     4am-node12   <none>           <none>
fouram-93-8048-milvus-datanode-74b6bff58b-wxn5p                   1/1     Running     2 (18h ago)     18h     10.104.16.71    4am-node21   <none>           <none>
fouram-93-8048-milvus-indexcoord-6d5c55c6b5-zgf2p                 1/1     Running     0               18h     10.104.21.170   4am-node24   <none>           <none>
fouram-93-8048-milvus-indexnode-dc9dc4b9d-tmrjj                   1/1     Running     1 (18h ago)     18h     10.104.5.95     4am-node12   <none>           <none>
fouram-93-8048-milvus-proxy-ffcdfffd9-8nw9l                       1/1     Running     1 (18h ago)     18h     10.104.20.128   4am-node22   <none>           <none>
fouram-93-8048-milvus-querycoord-55d96dfd6-drpnb                  1/1     Running     1 (18h ago)     18h     10.104.6.173    4am-node13   <none>           <none>
fouram-93-8048-milvus-querynode-78fb7c9876-ggjl6                  1/1     Running     0               5m16s   10.104.15.13    4am-node20   <none>           <none>
fouram-93-8048-milvus-querynode-78fb7c9876-n9gnt                  0/1     Error       1               18h     10.104.24.32    4am-node29   <none>           <none>
fouram-93-8048-milvus-rootcoord-58bd4c8778-q625p                  1/1     Running     2 (18h ago)     18h     10.104.1.11     4am-node10   <none>           <none>
fouram-93-8048-minio-0                                            1/1     Running     0               18h     10.104.1.32     4am-node10   <none>           <none>
fouram-93-8048-minio-1                                            1/1     Running     0               18h     10.104.16.89    4am-node21   <none>           <none>
fouram-93-8048-minio-2                                            1/1     Running     0               18h     10.104.20.143   4am-node22   <none>           <none>
fouram-93-8048-minio-3                                            1/1     Running     0               18h     10.104.9.213    4am-node14   <none>           <none>
fouram-93-8048-pulsar-bookie-0                                    1/1     Running     0               18h     10.104.6.196    4am-node13   <none>           <none>
fouram-93-8048-pulsar-bookie-1                                    1/1     Running     0               18h     10.104.20.144   4am-node22   <none>           <none>
fouram-93-8048-pulsar-bookie-2                                    1/1     Running     0               18h     10.104.21.181   4am-node24   <none>           <none>
fouram-93-8048-pulsar-bookie-init-9t8nt                           0/1     Completed   0               18h     10.104.16.70    4am-node21   <none>           <none>
fouram-93-8048-pulsar-broker-0                                    1/1     Running     0               18h     10.104.1.12     4am-node10   <none>           <none>
fouram-93-8048-pulsar-proxy-0                                     1/1     Running     0               18h     10.104.23.91    4am-node27   <none>           <none>
fouram-93-8048-pulsar-pulsar-init-lb7dz                           0/1     Completed   0               18h     10.104.16.69    4am-node21   <none>           <none>
fouram-93-8048-pulsar-recovery-0                                  1/1     Running     0               18h     10.104.6.174    4am-node13   <none>           <none>
fouram-93-8048-pulsar-zookeeper-0                                 1/1     Running     0               18h     10.104.20.139   4am-node22   <none>           <none>
fouram-93-8048-pulsar-zookeeper-1                                 1/1     Running     0               18h     10.104.15.120   4am-node20   <none>           <none>
fouram-93-8048-pulsar-zookeeper-2                                 1/1     Running     0               18h     10.104.21.183   4am-node24   <none>           <none> 

client error log:

[2023-06-13 09:52:05,789 -  INFO - fouram]: [Base] Start flush collection fouram_hWyIQzMw (base.py:277)
[2023-06-13 09:52:08,326 -  INFO - fouram]: [Base] Params of index: [{'float_vector': {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}}}] (base.py:441)
[2023-06-13 09:52:08,326 -  INFO - fouram]: [Base] Start release collection fouram_hWyIQzMw (base.py:288)
[2023-06-13 09:52:08,328 -  INFO - fouram]: [Base] Start build index of DISKANN for collection fouram_hWyIQzMw, params:{'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}} (base.py:427)
[2023-06-14 02:53:51,385 -  INFO - fouram]: [Time] Index run in 61303.0546s (api_request.py:45)
[2023-06-14 02:53:51,385 -  INFO - fouram]: [CommonCases] RT of build index DISKANN: 61303.0546s (common_cases.py:96)
[2023-06-14 02:53:51,388 -  INFO - fouram]: [Base] Params of index: [{'float_vector': {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}}}] (base.py:441)
[2023-06-14 02:53:51,388 -  INFO - fouram]: [CommonCases] Prepare index DISKANN done. (common_cases.py:99)
[2023-06-14 02:53:51,388 -  INFO - fouram]: [CommonCases] No scalars need to be indexed. (common_cases.py:107)
[2023-06-14 02:53:51,389 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_hWyIQzMw): 100000000 (base.py:468)
[2023-06-14 02:53:51,390 -  INFO - fouram]: [Base] Start load collection fouram_hWyIQzMw,replica_number:1,kwargs:{} (base.py:283)
[2023-06-14 03:03:54,674 - ERROR - fouram]: RPC error: [get_loading_progress], <MilvusException: (code=1, message=collection 442144025703350773 has not been loaded to memory or load failed)>, <Time:{'RPC start': '2023-06-14 03:03:54.672870', 'RPC error': '2023-06-14 03:03:54.674862'}> (decorators.py:108)
[2023-06-14 03:03:54,676 - ERROR - fouram]: RPC error: [wait_for_loading_collection], <MilvusException: (code=1, message=collection 442144025703350773 has not been loaded to memory or load failed)>, <Time:{'RPC start': '2023-06-14 02:53:51.410122', 'RPC error': '2023-06-14 03:03:54.676319'}> (decorators.py:108)
[2023-06-14 03:03:54,676 - ERROR - fouram]: RPC error: [load_collection], <MilvusException: (code=1, message=collection 442144025703350773 has not been loaded to memory or load failed)>, <Time:{'RPC start': '2023-06-14 02:53:51.390216', 'RPC error': '2023-06-14 03:03:54.676433'}> (decorators.py:108)
[2023-06-14 03:03:54,677 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=collection 442144025703350773 has not been loaded to memory or load failed)> (api_request.py:53)
[2023-06-14 03:03:54,678 - ERROR - fouram]: [CheckFunc] load request check failed, response:<MilvusException: (code=1, message=collection 442144025703350773 has not been loaded to memory or load failed)> (func_check.py:52)

elstic avatar Jun 14 '23 04:06 elstic

image:master-20230614-35cb0b5b

[2023-06-14 08:44:14,237 - INFO - fouram]: [check_params] scene_concurrent_locust required params: {'dataset_params': {'metric_type': 'L2', 'dim': 128, 'dataset_name': 'sift', 'dataset_size': '1m', 'ni_per': 50000}, 'collection_params': {'other_fields': []}, 'load_params': {}, 'query_params': {}, 'search_params': {}, 'index_params': {'index_type': 'DISKANN', 'index_param': {}}, 'concurrent_params': {'concurrent_number': [1, 20], 'during_time': 3600, 'interval': 20}, 'concurrent_tasks': [{'type': 'search', 'weight': 1, 'params': {'nq': 1, 'top_k': 1, 'search_param': {'search_list': 30}, 'random_data': True}}]} (params_check.py:31)

The error report is different, please see if it is caused by the same reason

server:


 I0614 08:59:37.102417     455 request.go:665] Waited for 1.167015255s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/argoproj.io/v1alpha1?timeout=32s
NAME                                                              READY   STATUS      RESTARTS      AGE     IP              NODE         NOMINATED NODE   READINESS GATES
perf-single-16831400-4-39-6861-etcd-0                             1/1     Running     0             27m     10.104.20.36    4am-node22   <none>           <none>
perf-single-16831400-4-39-6861-milvus-standalone-654b9bf55q8s78   1/1     Running     0             27m     10.104.23.219   4am-node27   <none>           <none>
perf-single-16831400-4-39-6861-minio-5f5cf8c85d-d2f8r             1/1     Running     0             27m     10.104.23.218   4am-node27   <none>           <none> (cli_client.py:131)

log


[2023-06-14 08:49:34,413 -  INFO - fouram]: [CommonCases] RT of build index DISKANN: 274.4136s (common_cases.py:96)
[2023-06-14 08:49:34,414 -  INFO - fouram]: [Base] Params of index: [{'float_vector': {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}}}] (base.py:441)
[2023-06-14 08:49:34,414 -  INFO - fouram]: [CommonCases] Prepare index DISKANN done. (common_cases.py:99)
[2023-06-14 08:49:34,415 -  INFO - fouram]: [CommonCases] No scalars need to be indexed. (common_cases.py:107)
[2023-06-14 08:49:34,416 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_F9X74bJw): 1000000 (base.py:468)
[2023-06-14 08:49:34,416 -  INFO - fouram]: [Base] Start load collection fouram_F9X74bJw,replica_number:1,kwargs:{} (base.py:283)
[2023-06-14 08:59:35,630 -  INFO - fouram]: [Time] Collection.load run in 601.2136s (api_request.py:45)
[2023-06-14 08:59:35,635 -  INFO - fouram]: [Base] Describe resource group:__default_resource_group, ResourceGroupInfo:<name:__default_resource_group>,<capacity:1000000>,<num_available_node:1>,<num_loaded_replica:{'fouram_F9X74bJw':1}>,<num_outgoing_node:{}>,<num_incoming_node:{}> (base.py:642)
[2023-06-14 08:59:35,639 - ERROR - fouram]: RPC error: [get_replicas], <MilvusException: (code=15, message=failed to get replica info, err=failed to get channels, collection not loaded: collection=442166710995255856: collection not found)>, <Time:{'RPC start': '2023-06-14 08:59:35.635739', 'RPC error': '2023-06-14 08:59:35.639028'}> (decorators.py:108)
[2023-06-14 08:59:35,640 - ERROR - fouram]: (api_response) : <MilvusException: (code=15, message=failed to get replica info, err=failed to get channels, collection not loaded: collection=442166710995255856: collection not found)> (api_request.py:53)
[2023-06-14 08:59:35,640 - ERROR - fouram]: [CheckFunc] get_replicas request check failed, response:<MilvusException: (code=15, message=failed to get replica info, err=failed to get channels, collection not loaded: collection=442166710995255856: collection not found)> 

jingkl avatar Jun 14 '23 10:06 jingkl

release_name_prefix perf-single-1686731400 deploy_config fouramf-server-standalone-8c16m-disk case_params fouramf-client-gist1m-concurrent-diskann

image:master-20230614-35cb0b5b

[2023-06-14 08:40:24,911 - INFO - fouram]: [check_params] scene_concurrent_locust required params: {'dataset_params': {'metric_type': 'L2', 'dim': 768, 'dataset_name': 'gist', 'dataset_size': 1000000, 'ni_per': 1000}, 'collection_params': {'other_fields': []}, 'load_params': {}, 'query_params': {}, 'search_params': {}, 'index_params': {'index_type': 'DISKANN', 'index_param': {}}, 'concurrent_params': {'concurrent_number': [1, 20], 'during_time': 3600, 'interval': 20}, 'concurrent_tasks': [{'type': 'search', 'weight': 1, 'params': {'nq': 1, 'top_k': 1, 'search_param': {'search_list': 30}, 'random_data': True}}]} (params_check.py:31)

server:


NAME                                                              READY   STATUS      RESTARTS      AGE     IP              NODE         NOMINATED NODE   READINESS GATES
perf-single-16831400-3-83-2286-etcd-0                             1/1     Running     0             63m     10.104.16.166   4am-node21   <none>           <none>
perf-single-16831400-3-83-2286-milvus-standalone-68fb4b9c6b9mtg   1/1     Running     0             63m     10.104.24.153   4am-node29   <none>           <none>
perf-single-16831400-3-83-2286-minio-56fb848f49-lgq7c             1/1     Running     0             63m     10.104.4.253    4am-node11   <none>           <none>

log:

[2023-06-14 08:45:23,667 -  INFO - fouram]: [Base] Total time of insert: 224.2344s, average number of vector bars inserted per second: 4459.619, average time to insert 1000 vectors per time: 0.2242s (base.py:379)
[2023-06-14 08:45:23,667 -  INFO - fouram]: [Base] Start flush collection fouram_da8NtObO (base.py:277)
[2023-06-14 08:45:27,336 -  INFO - fouram]: [Base] Params of index: [{'float_vector': {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}}}] (base.py:441)
[2023-06-14 08:45:27,336 -  INFO - fouram]: [Base] Start release collection fouram_da8NtObO (base.py:288)
[2023-06-14 08:45:27,338 -  INFO - fouram]: [Base] Start build index of DISKANN for collection fouram_da8NtObO, params:{'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}} (base.py:427)
[2023-06-14 09:25:21,209 -  INFO - fouram]: [Time] Index run in 2393.8687s (api_request.py:45)
[2023-06-14 09:25:21,212 -  INFO - fouram]: [CommonCases] RT of build index DISKANN: 2393.8687s (common_cases.py:96)
[2023-06-14 09:25:21,215 -  INFO - fouram]: [Base] Params of index: [{'float_vector': {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}}}] (base.py:441)
[2023-06-14 09:25:21,215 -  INFO - fouram]: [CommonCases] Prepare index DISKANN done. (common_cases.py:99)
[2023-06-14 09:25:21,215 -  INFO - fouram]: [CommonCases] No scalars need to be indexed. (common_cases.py:107)
[2023-06-14 09:25:21,217 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_da8NtObO): 1000000 (base.py:468)
[2023-06-14 09:25:21,217 -  INFO - fouram]: [Base] Start load collection fouram_da8NtObO,replica_number:1,kwargs:{} (base.py:283)
[2023-06-14 09:35:24,250 - ERROR - fouram]: RPC error: [get_loading_progress], <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_da8NtObO)>, <Time:{'RPC start': '2023-06-14 09:35:24.248721', 'RPC error': '2023-06-14 09:35:24.250173'}> (decorators.py:108)
[2023-06-14 09:35:24,252 - ERROR - fouram]: RPC error: [wait_for_loading_collection], <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_da8NtObO)>, <Time:{'RPC start': '2023-06-14 09:25:21.227046', 'RPC error': '2023-06-14 09:35:24.251987'}> (decorators.py:108)
[2023-06-14 09:35:24,252 - ERROR - fouram]: RPC error: [load_collection], <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_da8NtObO)>, <Time:{'RPC start': '2023-06-14 09:25:21.217910', 'RPC error': '2023-06-14 09:35:24.252142'}> (decorators.py:108)
[2023-06-14 09:35:24,254 - ERROR - fouram]: (api_response) : <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_da8NtObO)> (api_request.py:53)
[2023-06-14 09:35:24,254 - ERROR - fouram]: [CheckFunc] load request check failed, response:<MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_da8NtObO)> (func_check.py:52

jingkl avatar Jun 14 '23 10:06 jingkl

The issue arises again image: master-20230504-e172f3e8

server:

fouram-93-8048-etcd-0                                             1/1     Running     0               9m21s   10.104.1.34     4am-node10   <none>           <none>
fouram-93-8048-etcd-1                                             1/1     Running     0               9m21s   10.104.16.91    4am-node21   <none>           <none>
fouram-93-8048-etcd-2                                             1/1     Running     0               9m21s   10.104.21.178   4am-node24   <none>           <none>
fouram-93-8048-milvus-datacoord-7595dbc5c4-kp887                  1/1     Running     1 (5m21s ago)   9m21s   10.104.5.96     4am-node12   <none>           <none>
fouram-93-8048-milvus-datanode-74b6bff58b-wxn5p                   1/1     Running     2 (110s ago)    9m21s   10.104.16.71    4am-node21   <none>           <none>
fouram-93-8048-milvus-indexcoord-6d5c55c6b5-zgf2p                 1/1     Running     0               9m21s   10.104.21.170   4am-node24   <none>           <none>
fouram-93-8048-milvus-indexnode-dc9dc4b9d-tmrjj                   1/1     Running     1 (5m21s ago)   9m21s   10.104.5.95     4am-node12   <none>           <none>
fouram-93-8048-milvus-proxy-ffcdfffd9-8nw9l                       1/1     Running     1 (5m21s ago)   9m21s   10.104.20.128   4am-node22   <none>           <none>
fouram-93-8048-milvus-querycoord-55d96dfd6-drpnb                  1/1     Running     1 (5m21s ago)   9m21s   10.104.6.173    4am-node13   <none>           <none>
fouram-93-8048-milvus-querynode-78fb7c9876-n9gnt                  1/1     Running     1 (5m21s ago)   9m21s   10.104.24.32    4am-node29   <none>           <none>
fouram-93-8048-milvus-rootcoord-58bd4c8778-q625p                  1/1     Running     2 (110s ago)    9m21s   10.104.1.11     4am-node10   <none>           <none>
fouram-93-8048-minio-0                                            1/1     Running     0               9m21s   10.104.1.32     4am-node10   <none>           <none>
fouram-93-8048-minio-1                                            1/1     Running     0               9m21s   10.104.16.89    4am-node21   <none>           <none>
fouram-93-8048-minio-2                                            1/1     Running     0               9m21s   10.104.20.143   4am-node22   <none>           <none>
fouram-93-8048-minio-3                                            1/1     Running     0               9m20s   10.104.9.213    4am-node14   <none>           <none>
fouram-93-8048-pulsar-bookie-0                                    1/1     Running     0               9m21s   10.104.6.196    4am-node13   <none>           <none>
fouram-93-8048-pulsar-bookie-1                                    1/1     Running     0               9m20s   10.104.20.144   4am-node22   <none>           <none>
fouram-93-8048-pulsar-bookie-2                                    1/1     Running     0               9m20s   10.104.21.181   4am-node24   <none>           <none>
fouram-93-8048-pulsar-bookie-init-9t8nt                           0/1     Completed   0               9m21s   10.104.16.70    4am-node21   <none>           <none>
fouram-93-8048-pulsar-broker-0                                    1/1     Running     0               9m21s   10.104.1.12     4am-node10   <none>           <none>
fouram-93-8048-pulsar-proxy-0                                     1/1     Running     0               9m21s   10.104.23.91    4am-node27   <none>           <none>
fouram-93-8048-pulsar-pulsar-init-lb7dz                           0/1     Completed   0               9m21s   10.104.16.69    4am-node21   <none>           <none>
fouram-93-8048-pulsar-recovery-0                                  1/1     Running     0               9m21s   10.104.6.174    4am-node13   <none>           <none>
fouram-93-8048-pulsar-zookeeper-0                                 1/1     Running     0               9m21s   10.104.20.139   4am-node22   <none>           <none>
fouram-93-8048-pulsar-zookeeper-1                                 1/1     Running     0               5m20s   10.104.15.120   4am-node20   <none>           <none>
fouram-93-8048-pulsar-zookeeper-2                                 1/1     Running     0               3m51s   10.104.21.183   4am-node24   <none>           <none>

server (after):

fouram-93-8048-etcd-0                                             1/1     Running     0               18h     10.104.1.34     4am-node10   <none>           <none>
fouram-93-8048-etcd-1                                             1/1     Running     0               18h     10.104.16.91    4am-node21   <none>           <none>
fouram-93-8048-etcd-2                                             1/1     Running     0               18h     10.104.21.178   4am-node24   <none>           <none>
fouram-93-8048-milvus-datacoord-7595dbc5c4-kp887                  1/1     Running     1 (18h ago)     18h     10.104.5.96     4am-node12   <none>           <none>
fouram-93-8048-milvus-datanode-74b6bff58b-wxn5p                   1/1     Running     2 (18h ago)     18h     10.104.16.71    4am-node21   <none>           <none>
fouram-93-8048-milvus-indexcoord-6d5c55c6b5-zgf2p                 1/1     Running     0               18h     10.104.21.170   4am-node24   <none>           <none>
fouram-93-8048-milvus-indexnode-dc9dc4b9d-tmrjj                   1/1     Running     1 (18h ago)     18h     10.104.5.95     4am-node12   <none>           <none>
fouram-93-8048-milvus-proxy-ffcdfffd9-8nw9l                       1/1     Running     1 (18h ago)     18h     10.104.20.128   4am-node22   <none>           <none>
fouram-93-8048-milvus-querycoord-55d96dfd6-drpnb                  1/1     Running     1 (18h ago)     18h     10.104.6.173    4am-node13   <none>           <none>
fouram-93-8048-milvus-querynode-78fb7c9876-ggjl6                  1/1     Running     0               5m16s   10.104.15.13    4am-node20   <none>           <none>
fouram-93-8048-milvus-querynode-78fb7c9876-n9gnt                  0/1     Error       1               18h     10.104.24.32    4am-node29   <none>           <none>
fouram-93-8048-milvus-rootcoord-58bd4c8778-q625p                  1/1     Running     2 (18h ago)     18h     10.104.1.11     4am-node10   <none>           <none>
fouram-93-8048-minio-0                                            1/1     Running     0               18h     10.104.1.32     4am-node10   <none>           <none>
fouram-93-8048-minio-1                                            1/1     Running     0               18h     10.104.16.89    4am-node21   <none>           <none>
fouram-93-8048-minio-2                                            1/1     Running     0               18h     10.104.20.143   4am-node22   <none>           <none>
fouram-93-8048-minio-3                                            1/1     Running     0               18h     10.104.9.213    4am-node14   <none>           <none>
fouram-93-8048-pulsar-bookie-0                                    1/1     Running     0               18h     10.104.6.196    4am-node13   <none>           <none>
fouram-93-8048-pulsar-bookie-1                                    1/1     Running     0               18h     10.104.20.144   4am-node22   <none>           <none>
fouram-93-8048-pulsar-bookie-2                                    1/1     Running     0               18h     10.104.21.181   4am-node24   <none>           <none>
fouram-93-8048-pulsar-bookie-init-9t8nt                           0/1     Completed   0               18h     10.104.16.70    4am-node21   <none>           <none>
fouram-93-8048-pulsar-broker-0                                    1/1     Running     0               18h     10.104.1.12     4am-node10   <none>           <none>
fouram-93-8048-pulsar-proxy-0                                     1/1     Running     0               18h     10.104.23.91    4am-node27   <none>           <none>
fouram-93-8048-pulsar-pulsar-init-lb7dz                           0/1     Completed   0               18h     10.104.16.69    4am-node21   <none>           <none>
fouram-93-8048-pulsar-recovery-0                                  1/1     Running     0               18h     10.104.6.174    4am-node13   <none>           <none>
fouram-93-8048-pulsar-zookeeper-0                                 1/1     Running     0               18h     10.104.20.139   4am-node22   <none>           <none>
fouram-93-8048-pulsar-zookeeper-1                                 1/1     Running     0               18h     10.104.15.120   4am-node20   <none>           <none>
fouram-93-8048-pulsar-zookeeper-2                                 1/1     Running     0               18h     10.104.21.183   4am-node24   <none>           <none> 

client error log:

[2023-06-13 09:52:05,789 -  INFO - fouram]: [Base] Start flush collection fouram_hWyIQzMw (base.py:277)
[2023-06-13 09:52:08,326 -  INFO - fouram]: [Base] Params of index: [{'float_vector': {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}}}] (base.py:441)
[2023-06-13 09:52:08,326 -  INFO - fouram]: [Base] Start release collection fouram_hWyIQzMw (base.py:288)
[2023-06-13 09:52:08,328 -  INFO - fouram]: [Base] Start build index of DISKANN for collection fouram_hWyIQzMw, params:{'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}} (base.py:427)
[2023-06-14 02:53:51,385 -  INFO - fouram]: [Time] Index run in 61303.0546s (api_request.py:45)
[2023-06-14 02:53:51,385 -  INFO - fouram]: [CommonCases] RT of build index DISKANN: 61303.0546s (common_cases.py:96)
[2023-06-14 02:53:51,388 -  INFO - fouram]: [Base] Params of index: [{'float_vector': {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}}}] (base.py:441)
[2023-06-14 02:53:51,388 -  INFO - fouram]: [CommonCases] Prepare index DISKANN done. (common_cases.py:99)
[2023-06-14 02:53:51,388 -  INFO - fouram]: [CommonCases] No scalars need to be indexed. (common_cases.py:107)
[2023-06-14 02:53:51,389 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_hWyIQzMw): 100000000 (base.py:468)
[2023-06-14 02:53:51,390 -  INFO - fouram]: [Base] Start load collection fouram_hWyIQzMw,replica_number:1,kwargs:{} (base.py:283)
[2023-06-14 03:03:54,674 - ERROR - fouram]: RPC error: [get_loading_progress], <MilvusException: (code=1, message=collection 442144025703350773 has not been loaded to memory or load failed)>, <Time:{'RPC start': '2023-06-14 03:03:54.672870', 'RPC error': '2023-06-14 03:03:54.674862'}> (decorators.py:108)
[2023-06-14 03:03:54,676 - ERROR - fouram]: RPC error: [wait_for_loading_collection], <MilvusException: (code=1, message=collection 442144025703350773 has not been loaded to memory or load failed)>, <Time:{'RPC start': '2023-06-14 02:53:51.410122', 'RPC error': '2023-06-14 03:03:54.676319'}> (decorators.py:108)
[2023-06-14 03:03:54,676 - ERROR - fouram]: RPC error: [load_collection], <MilvusException: (code=1, message=collection 442144025703350773 has not been loaded to memory or load failed)>, <Time:{'RPC start': '2023-06-14 02:53:51.390216', 'RPC error': '2023-06-14 03:03:54.676433'}> (decorators.py:108)
[2023-06-14 03:03:54,677 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=collection 442144025703350773 has not been loaded to memory or load failed)> (api_request.py:53)
[2023-06-14 03:03:54,678 - ERROR - fouram]: [CheckFunc] load request check failed, response:<MilvusException: (code=1, message=collection 442144025703350773 has not been loaded to memory or load failed)> (func_check.py:52)

No panic, the cluster rebooted due to disconnection to pulsar

yah01 avatar Jun 15 '23 03:06 yah01

release_name_prefix perf-single-1686731400 deploy_config fouramf-server-standalone-8c16m-disk case_params fouramf-client-gist1m-concurrent-diskann

image:master-20230614-35cb0b5b

[2023-06-14 08:40:24,911 - INFO - fouram]: [check_params] scene_concurrent_locust required params: {'dataset_params': {'metric_type': 'L2', 'dim': 768, 'dataset_name': 'gist', 'dataset_size': 1000000, 'ni_per': 1000}, 'collection_params': {'other_fields': []}, 'load_params': {}, 'query_params': {}, 'search_params': {}, 'index_params': {'index_type': 'DISKANN', 'index_param': {}}, 'concurrent_params': {'concurrent_number': [1, 20], 'during_time': 3600, 'interval': 20}, 'concurrent_tasks': [{'type': 'search', 'weight': 1, 'params': {'nq': 1, 'top_k': 1, 'search_param': {'search_list': 30}, 'random_data': True}}]} (params_check.py:31)

server:


NAME                                                              READY   STATUS      RESTARTS      AGE     IP              NODE         NOMINATED NODE   READINESS GATES
perf-single-16831400-3-83-2286-etcd-0                             1/1     Running     0             63m     10.104.16.166   4am-node21   <none>           <none>
perf-single-16831400-3-83-2286-milvus-standalone-68fb4b9c6b9mtg   1/1     Running     0             63m     10.104.24.153   4am-node29   <none>           <none>
perf-single-16831400-3-83-2286-minio-56fb848f49-lgq7c             1/1     Running     0             63m     10.104.4.253    4am-node11   <none>           <none>

log:

[2023-06-14 08:45:23,667 -  INFO - fouram]: [Base] Total time of insert: 224.2344s, average number of vector bars inserted per second: 4459.619, average time to insert 1000 vectors per time: 0.2242s (base.py:379)
[2023-06-14 08:45:23,667 -  INFO - fouram]: [Base] Start flush collection fouram_da8NtObO (base.py:277)
[2023-06-14 08:45:27,336 -  INFO - fouram]: [Base] Params of index: [{'float_vector': {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}}}] (base.py:441)
[2023-06-14 08:45:27,336 -  INFO - fouram]: [Base] Start release collection fouram_da8NtObO (base.py:288)
[2023-06-14 08:45:27,338 -  INFO - fouram]: [Base] Start build index of DISKANN for collection fouram_da8NtObO, params:{'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}} (base.py:427)
[2023-06-14 09:25:21,209 -  INFO - fouram]: [Time] Index run in 2393.8687s (api_request.py:45)
[2023-06-14 09:25:21,212 -  INFO - fouram]: [CommonCases] RT of build index DISKANN: 2393.8687s (common_cases.py:96)
[2023-06-14 09:25:21,215 -  INFO - fouram]: [Base] Params of index: [{'float_vector': {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}}}] (base.py:441)
[2023-06-14 09:25:21,215 -  INFO - fouram]: [CommonCases] Prepare index DISKANN done. (common_cases.py:99)
[2023-06-14 09:25:21,215 -  INFO - fouram]: [CommonCases] No scalars need to be indexed. (common_cases.py:107)
[2023-06-14 09:25:21,217 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_da8NtObO): 1000000 (base.py:468)
[2023-06-14 09:25:21,217 -  INFO - fouram]: [Base] Start load collection fouram_da8NtObO,replica_number:1,kwargs:{} (base.py:283)
[2023-06-14 09:35:24,250 - ERROR - fouram]: RPC error: [get_loading_progress], <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_da8NtObO)>, <Time:{'RPC start': '2023-06-14 09:35:24.248721', 'RPC error': '2023-06-14 09:35:24.250173'}> (decorators.py:108)
[2023-06-14 09:35:24,252 - ERROR - fouram]: RPC error: [wait_for_loading_collection], <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_da8NtObO)>, <Time:{'RPC start': '2023-06-14 09:25:21.227046', 'RPC error': '2023-06-14 09:35:24.251987'}> (decorators.py:108)
[2023-06-14 09:35:24,252 - ERROR - fouram]: RPC error: [load_collection], <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_da8NtObO)>, <Time:{'RPC start': '2023-06-14 09:25:21.217910', 'RPC error': '2023-06-14 09:35:24.252142'}> (decorators.py:108)
[2023-06-14 09:35:24,254 - ERROR - fouram]: (api_response) : <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_da8NtObO)> (api_request.py:53)
[2023-06-14 09:35:24,254 - ERROR - fouram]: [CheckFunc] load request check failed, response:<MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_da8NtObO)> (func_check.py:52

This is not the same issue, it run out of the memory, maybe #24374 would help

yah01 avatar Jun 15 '23 03:06 yah01

[UnexpectedError] Assert "row_count > 0" errors reported

yah01 avatar Jun 15 '23 03:06 yah01

/assign @cqy123456 Knowhere behavior changed, @cqy123456 is fixing this

yah01 avatar Jun 15 '23 03:06 yah01

https://github.com/milvus-io/milvus/pull/24898

yah01 avatar Jun 15 '23 04:06 yah01

/assign @elstic fixed with #24898

yah01 avatar Jun 19 '23 03:06 yah01

/assign @elstic fixed with #24898

load 100 million data failures.

image: master-20230620-247f1170 case : test_concurrent_locust_100m_diskann_ddl_dql_filter_cluster

server:

fouramf-x8wsv-92-7100-etcd-0                                      1/1     Running            0               18h    10.104.23.251   4am-node27   <none>           <none>
fouramf-x8wsv-92-7100-etcd-1                                      1/1     Running            0               18h    10.104.17.29    4am-node23   <none>           <none>
fouramf-x8wsv-92-7100-etcd-2                                      1/1     Running            0               18h    10.104.21.134   4am-node24   <none>           <none>
fouramf-x8wsv-92-7100-milvus-datacoord-8f5f9bcf5-bvhj7            1/1     Running            0               18h    10.104.21.127   4am-node24   <none>           <none>
fouramf-x8wsv-92-7100-milvus-datanode-5c758bb86c-7qq5v            1/1     Running            0               18h    10.104.23.243   4am-node27   <none>           <none>
fouramf-x8wsv-92-7100-milvus-indexcoord-8656d87f6b-8ndmc          1/1     Running            0               18h    10.104.15.131   4am-node20   <none>           <none>
fouramf-x8wsv-92-7100-milvus-indexnode-b58f5dd77-m7f7j            1/1     Running            0               18h    10.104.15.132   4am-node20   <none>           <none>
fouramf-x8wsv-92-7100-milvus-proxy-7868b75c64-7btb2               1/1     Running            0               18h    10.104.20.67    4am-node22   <none>           <none>
fouramf-x8wsv-92-7100-milvus-querycoord-8b4796845-rch85           1/1     Running            0               18h    10.104.21.126   4am-node24   <none>           <none>
fouramf-x8wsv-92-7100-milvus-querynode-86b56d4999-nwjqb           1/1     Running            0               10m    10.104.1.96     4am-node10   <none>           <none>
fouramf-x8wsv-92-7100-milvus-querynode-86b56d4999-swbvs           0/1     Completed          0               18h    10.104.20.68    4am-node22   <none>           <none>
fouramf-x8wsv-92-7100-milvus-rootcoord-85ffcddfd7-qnnzd           1/1     Running            0               18h    10.104.20.66    4am-node22   <none>           <none>
fouramf-x8wsv-92-7100-minio-0                                     1/1     Running            0               18h    10.104.23.247   4am-node27   <none>           <none>
fouramf-x8wsv-92-7100-minio-1                                     1/1     Running            0               18h    10.104.17.30    4am-node23   <none>           <none>
fouramf-x8wsv-92-7100-minio-2                                     1/1     Running            0               18h    10.104.21.133   4am-node24   <none>           <none>
fouramf-x8wsv-92-7100-minio-3                                     1/1     Running            0               18h    10.104.20.70    4am-node22   <none>           <none>
fouramf-x8wsv-92-7100-pulsar-bookie-0                             1/1     Running            0               18h    10.104.21.129   4am-node24   <none>           <none>
fouramf-x8wsv-92-7100-pulsar-bookie-1                             1/1     Running            0               18h    10.104.23.252   4am-node27   <none>           <none>
fouramf-x8wsv-92-7100-pulsar-bookie-2                             1/1     Running            0               18h    10.104.17.31    4am-node23   <none>           <none>
fouramf-x8wsv-92-7100-pulsar-bookie-init-94m89                    0/1     Completed          0               18h    10.104.21.125   4am-node24   <none>           <none>
fouramf-x8wsv-92-7100-pulsar-broker-0                             1/1     Running            0               18h    10.104.23.244   4am-node27   <none>           <none>
fouramf-x8wsv-92-7100-pulsar-proxy-0                              1/1     Running            0               18h    10.104.15.133   4am-node20   <none>           <none>
fouramf-x8wsv-92-7100-pulsar-pulsar-init-5vbmm                    0/1     Completed          0               18h    10.104.23.245   4am-node27   <none>           <none>
fouramf-x8wsv-92-7100-pulsar-recovery-0                           1/1     Running            0               18h    10.104.15.134   4am-node20   <none>           <none>
fouramf-x8wsv-92-7100-pulsar-zookeeper-0                          1/1     Running            0               18h    10.104.21.130   4am-node24   <none>           <none>
fouramf-x8wsv-92-7100-pulsar-zookeeper-1                          1/1     Running            0               18h    10.104.16.108   4am-node21   <none>           <none>
fouramf-x8wsv-92-7100-pulsar-zookeeper-2                          1/1     Running            0               18h    10.104.23.254   4am-node27   <none>           <none> 

client error log:

[2023-06-20 07:43:06,497 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_6QRP8asl): 99900000 (base.py:468)
[2023-06-20 07:43:06,539 -  INFO - fouram]: [Base] Total time of insert: 2554.1683s, average number of vector bars inserted per second: 39151.6879, average time to insert 50000 vectors per time: 1.2771s (base.py:379)
[2023-06-20 07:43:06,540 -  INFO - fouram]: [Base] Start flush collection fouram_6QRP8asl (base.py:277)
[2023-06-20 07:43:09,565 -  INFO - fouram]: [Base] Params of index: [{'float_vector': {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}}}] (base.py:441)
[2023-06-20 07:43:09,565 -  INFO - fouram]: [Base] Start release collection fouram_6QRP8asl (base.py:288)
[2023-06-20 07:43:09,567 -  INFO - fouram]: [Base] Start build index of DISKANN for collection fouram_6QRP8asl, params:{'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}} (base.py:427)
[2023-06-21 00:53:02,185 -  INFO - fouram]: [Time] Index run in 61792.6171s (api_request.py:45)
[2023-06-21 00:53:02,186 -  INFO - fouram]: [CommonCases] RT of build index DISKANN: 61792.6171s (common_cases.py:96)
[2023-06-21 00:53:02,188 -  INFO - fouram]: [Base] Params of index: [{'float_vector': {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}}}] (base.py:441)
[2023-06-21 00:53:02,189 -  INFO - fouram]: [CommonCases] Prepare index DISKANN done. (common_cases.py:99)
[2023-06-21 00:53:02,189 -  INFO - fouram]: [CommonCases] No scalars need to be indexed. (common_cases.py:107)
[2023-06-21 00:53:02,190 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_6QRP8asl): 100000000 (base.py:468)
[2023-06-21 00:53:02,190 -  INFO - fouram]: [Base] Start load collection fouram_6QRP8asl,replica_number:1,kwargs:{} (base.py:283)
[2023-06-21 01:07:47,581 - ERROR - fouram]: RPC error: [get_loading_progress], <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_6QRP8asl)>, <Time:{'RPC start': '2023-06-21 01:07:47.578558', 'RPC error': '2023-06-21 01:07:47.581678'}> (decorators.py:108)
[2023-06-21 01:07:47,583 - ERROR - fouram]: RPC error: [wait_for_loading_collection], <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_6QRP8asl)>, <Time:{'RPC start': '2023-06-21 00:53:02.215572', 'RPC error': '2023-06-21 01:07:47.583758'}> (decorators.py:108)
[2023-06-21 01:07:47,583 - ERROR - fouram]: RPC error: [load_collection], <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_6QRP8asl)>, <Time:{'RPC start': '2023-06-21 00:53:02.191111', 'RPC error': '2023-06-21 01:07:47.583882'}> (decorators.py:108)
[2023-06-21 01:07:47,585 - ERROR - fouram]: (api_response) : <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_6QRP8asl)> (api_request.py:53)
[2023-06-21 01:07:47,585 - ERROR - fouram]: [CheckFunc] load request check failed, response:<MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_6QRP8asl)> (func_check.py:52)

elstic avatar Jun 21 '23 01:06 elstic

/assign @elstic Please try with #25469

yah01 avatar Jul 11 '23 06:07 yah01

/assign @elstic Please try with #25469

@yah01

diskann insert 100k data load failed. case: test_concurrent_locust_diskann_compaction_standalone image: master-20230711-70c4ddc6

client log:

[2023-07-11 20:07:16,394 -  INFO - fouram]: [Base] Start inserting, ids: 50000 - 99999, data size: 100,000 (base.py:309)
[2023-07-11 20:07:17,990 -  INFO - fouram]: [Time] Collection.insert run in 1.5948s (api_request.py:45)
[2023-07-11 20:07:17,992 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_5RfI3j1P): 0 (base.py:469)
[2023-07-11 20:07:18,083 -  INFO - fouram]: [Base] Total time of insert: 3.2848s, average number of vector bars inserted per second: 30443.2538, average time to insert 50000 vectors per time: 1.6424s (base.py:380)
[2023-07-11 20:07:18,085 -  INFO - fouram]: [Base] Start flush collection fouram_5RfI3j1P (base.py:278)
[2023-07-11 20:07:20,604 -  INFO - fouram]: [Base] Params of index: [{'float_vector': {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}}}] (base.py:442)
[2023-07-11 20:07:20,604 -  INFO - fouram]: [Base] Start release collection fouram_5RfI3j1P (base.py:289)
[2023-07-11 20:07:20,606 -  INFO - fouram]: [Base] Start build index of DISKANN for collection fouram_5RfI3j1P, params:{'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}} (base.py:428)
[2023-07-11 20:07:41,965 -  INFO - fouram]: [Time] Index run in 21.358s (api_request.py:45)
[2023-07-11 20:07:41,965 -  INFO - fouram]: [CommonCases] RT of build index DISKANN: 21.358s (common_cases.py:96)
[2023-07-11 20:07:41,967 -  INFO - fouram]: [Base] Params of index: [{'float_vector': {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}}}] (base.py:442)
[2023-07-11 20:07:41,967 -  INFO - fouram]: [CommonCases] Prepare index DISKANN done. (common_cases.py:99)
[2023-07-11 20:07:41,967 -  INFO - fouram]: [CommonCases] No scalars need to be indexed. (common_cases.py:107)
[2023-07-11 20:07:41,968 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_5RfI3j1P): 100000 (base.py:469)
[2023-07-11 20:07:41,968 -  INFO - fouram]: [Base] Start load collection fouram_5RfI3j1P,replica_number:1,kwargs:{} (base.py:284)
[2023-07-11 20:17:44,976 - ERROR - fouram]: RPC error: [get_loading_progress], <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_5RfI3j1P)>, <Time:{'RPC start': '2023-07-11 20:17:44.974639', 'RPC error': '2023-07-11 20:17:44.976785'}> (decorators.py:108)
[2023-07-11 20:17:44,978 - ERROR - fouram]: RPC error: [wait_for_loading_collection], <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_5RfI3j1P)>, <Time:{'RPC start': '2023-07-11 20:07:41.980907', 'RPC error': '2023-07-11 20:17:44.978523'}> (decorators.py:108)
[2023-07-11 20:17:44,978 - ERROR - fouram]: RPC error: [load_collection], <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_5RfI3j1P)>, <Time:{'RPC start': '2023-07-11 20:07:41.969109', 'RPC error': '2023-07-11 20:17:44.978801'}> (decorators.py:108)
[2023-07-11 20:17:44,980 - ERROR - fouram]: (api_response) : <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_5RfI3j1P)> (api_request.py:53)
[2023-07-11 20:17:44,980 - ERROR - fouram]: [CheckFunc] load request check failed, response:<MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_5RfI3j1P)> (func_check.py:52)

server:

fouramf-stable-05600-4-1-6061-etcd-0                              1/1     Running     0               16m     10.104.6.8      4am-node13   <none>           <none>
fouramf-stable-05600-4-1-6061-milvus-standalone-79d74db499n8scp   1/1     Running     0               16m     10.104.19.216   4am-node28   <none>           <none>
fouramf-stable-05600-4-1-6061-minio-9f8bb4794-nwqdz               1/1     Running     0               16m     10.104.4.21     4am-node11   <none>           <none>

elstic avatar Jul 12 '23 02:07 elstic

related https://github.com/milvus-io/knowhere/pull/991

yah01 avatar Jul 13 '23 10:07 yah01

After verification, inserting 100 million data, can load successfully. Verify image: master-20230719-e418ab2f

elstic avatar Jul 21 '23 02:07 elstic

The diskann index to insert 100 million data loads failed. image: master-20230728-c2693ea2

argo task : fouramf-concurrent-jhgfh, id : 1 case: test_concurrent_locust_100m_diskann_ddl_dql_filter_cluster

server:

fouram-15-6355-etcd-0                                             1/1     Running            0               7h17m   10.104.14.212   4am-node18   <none>           <none>
fouram-15-6355-etcd-1                                             1/1     Running            0               7h17m   10.104.23.187   4am-node27   <none>           <none>
fouram-15-6355-etcd-2                                             1/1     Running            0               7h17m   10.104.21.117   4am-node24   <none>           <none>
fouram-15-6355-milvus-datacoord-748cfb8b56-68prg                  1/1     Running            0               7h17m   10.104.19.175   4am-node28   <none>           <none>
fouram-15-6355-milvus-datanode-766b8767cf-sczqk                   1/1     Running            0               7h17m   10.104.13.130   4am-node16   <none>           <none>
fouram-15-6355-milvus-indexcoord-5d7f6bf49b-bfbcg                 1/1     Running            0               7h17m   10.104.19.179   4am-node28   <none>           <none>
fouram-15-6355-milvus-indexnode-778cb9b76c-cgmpx                  1/1     Running            0               7h17m   10.104.19.180   4am-node28   <none>           <none>
fouram-15-6355-milvus-proxy-688ddb867d-9vhvk                      1/1     Running            0               7h17m   10.104.14.203   4am-node18   <none>           <none>
fouram-15-6355-milvus-querycoord-7fbc57d7cf-76gqz                 1/1     Running            0               7h17m   10.104.14.204   4am-node18   <none>           <none>
fouram-15-6355-milvus-querynode-579f7bb7fc-xb6vd                  1/1     Running            0               7h17m   10.104.14.205   4am-node18   <none>           <none>
fouram-15-6355-milvus-rootcoord-6b9d4bdb8c-57tvc                  1/1     Running            0               7h17m   10.104.19.176   4am-node28   <none>           <none>
fouram-15-6355-minio-0                                            1/1     Running            0               7h17m   10.104.14.210   4am-node18   <none>           <none>
fouram-15-6355-minio-1                                            1/1     Running            0               7h17m   10.104.23.180   4am-node27   <none>           <none>
fouram-15-6355-minio-2                                            1/1     Running            0               7h17m   10.104.12.206   4am-node17   <none>           <none>
fouram-15-6355-minio-3                                            1/1     Running            0               7h17m   10.104.18.204   4am-node25   <none>           <none>
fouram-15-6355-pulsar-bookie-0                                    1/1     Running            0               7h17m   10.104.14.214   4am-node18   <none>           <none>
fouram-15-6355-pulsar-bookie-1                                    1/1     Running            0               7h17m   10.104.23.188   4am-node27   <none>           <none>
fouram-15-6355-pulsar-bookie-2                                    1/1     Running            0               7h17m   10.104.13.136   4am-node16   <none>           <none>
fouram-15-6355-pulsar-bookie-init-d5jhs                           0/1     Completed          0               7h17m   10.104.19.178   4am-node28   <none>           <none>
fouram-15-6355-pulsar-broker-0                                    1/1     Running            0               7h17m   10.104.13.131   4am-node16   <none>           <none>
fouram-15-6355-pulsar-proxy-0                                     1/1     Running            0               7h17m   10.104.13.132   4am-node16   <none>           <none>
fouram-15-6355-pulsar-pulsar-init-crzst                           0/1     Completed          0               7h17m   10.104.19.177   4am-node28   <none>           <none>
fouram-15-6355-pulsar-recovery-0                                  1/1     Running            0               7h17m   10.104.19.181   4am-node28   <none>           <none>
fouram-15-6355-pulsar-zookeeper-0                                 1/1     Running            0               7h17m   10.104.12.204   4am-node17   <none>           <none>
fouram-15-6355-pulsar-zookeeper-1                                 1/1     Running            0               7h16m   10.104.21.119   4am-node24   <none>           <none>
fouram-15-6355-pulsar-zookeeper-2                                 1/1     Running            0               7h15m   10.104.18.210   4am-node25   <none>           <none>

client log:

[2023-07-28 12:41:47,080 -  INFO - fouram]: [Base] Start inserting, ids: 99950000 - 99999999, data size: 100,000,000 (base.py:323)
[2023-07-28 12:41:49,792 -  INFO - fouram]: [Time] Collection.insert run in 2.7113s (api_request.py:45)
[2023-07-28 12:41:49,795 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_MlEaez5h): 99900000 (base.py:483)
[2023-07-28 12:41:49,867 -  INFO - fouram]: [Base] Total time of insert: 4070.0102s, average number of vector bars inserted per second: 24569.963, average time to insert 50000 vectors per time: 2.035s (base.py:394)
[2023-07-28 12:41:49,867 -  INFO - fouram]: [Base] Start flush collection fouram_MlEaez5h (base.py:292)
[2023-07-28 12:41:51,389 -  INFO - fouram]: [Base] Params of index: [{'float_vector': {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}}}] (base.py:456)
[2023-07-28 12:41:51,390 -  INFO - fouram]: [Base] Start release collection fouram_MlEaez5h (base.py:303)
[2023-07-28 12:41:51,392 -  INFO - fouram]: [Base] Start build index of DISKANN for collection fouram_MlEaez5h, params:{'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}} (base.py:442)
[2023-07-28 18:15:03,023 -  INFO - fouram]: [Time] Index run in 19991.6305s (api_request.py:45)
[2023-07-28 18:15:03,023 -  INFO - fouram]: [CommonCases] RT of build index DISKANN: 19991.6305s (common_cases.py:96)
[2023-07-28 18:15:03,025 -  INFO - fouram]: [Base] Params of index: [{'float_vector': {'index_type': 'DISKANN', 'metric_type': 'L2', 'params': {}}}] (base.py:456)
[2023-07-28 18:15:03,026 -  INFO - fouram]: [CommonCases] Prepare index DISKANN done. (common_cases.py:99)
[2023-07-28 18:15:03,026 -  INFO - fouram]: [CommonCases] No scalars need to be indexed. (common_cases.py:107)
[2023-07-28 18:15:03,027 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_MlEaez5h): 100000000 (base.py:483)
[2023-07-28 18:15:03,027 -  INFO - fouram]: [Base] Start load collection fouram_MlEaez5h,replica_number:1,kwargs:{} (base.py:298)
[2023-07-28 18:25:03,577 - ERROR - fouram]: RPC error: [get_loading_progress], <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_MlEaez5h)>, <Time:{'RPC start': '2023-07-28 18:25:03.371498', 'RPC error': '2023-07-28 18:25:03.576887'}> (decorators.py:126)
[2023-07-28 18:25:03,578 - ERROR - fouram]: RPC error: [wait_for_loading_collection], <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_MlEaez5h)>, <Time:{'RPC start': '2023-07-28 18:15:03.125622', 'RPC error': '2023-07-28 18:25:03.578006'}> (decorators.py:126)
[2023-07-28 18:25:03,578 - ERROR - fouram]: RPC error: [load_collection], <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_MlEaez5h)>, <Time:{'RPC start': '2023-07-28 18:15:03.028096', 'RPC error': '2023-07-28 18:25:03.578186'}> (decorators.py:126)
[2023-07-28 18:25:03,579 - ERROR - fouram]: (api_response) : <MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_MlEaez5h)> (api_request.py:53)
[2023-07-28 18:25:03,579 - ERROR - fouram]: [CheckFunc] load request check failed, response:<MilvusException: (code=52, message=deny to load, insufficient memory, please allocate more resources, collectionName: fouram_MlEaez5h)> 

memory usage: image

querynode error log: image

There seems to be querynode oom: image

elstic avatar Jul 31 '23 02:07 elstic

image Many small segments, maybe also related to #25928 for performance downgrade

yah01 avatar Jul 31 '23 03:07 yah01

@elstic tests this with 64GiB memory, and some problems found:

After step9, the predicted memory usage reduced and closed to the fact usage: image

During step9, many small segments got loading, the DiskANN memory usage prediction is higher than the fact usage, we see the prediction is about 32GiB: image

We need a way to control the concurrency level in QueryNode, the permitted load request got stuck while it can't request io pool, and it's memory usage still contributed to the memory usage predication

yah01 avatar Aug 01 '23 02:08 yah01

Working on this

yah01 avatar Aug 01 '23 02:08 yah01

/assign @elstic fixed by #26045

yah01 avatar Aug 01 '23 14:08 yah01

/assign @elstic fixed by #26045

issue fixed. Verify image: master-20230802-830f0678

elstic avatar Aug 03 '23 02:08 elstic