milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [benchmark][query] milvus concurrently query and returns vector, querynode restarts, query fails and reports: "fail to query on all shard leaders"

Open elstic opened this issue 1 year ago • 1 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:master-20230606-ea629228
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The client side only does concurrent quey

argo task : fouramf-concurrent-h9xqz

06:11:00 server(before):

fouram-40-4340-etcd-0                                             1/1     Running     0               6m      10.104.17.239   4am-node23   <none>           <none>
fouram-40-4340-etcd-1                                             1/1     Running     0               5m59s   10.104.21.153   4am-node24   <none>           <none>
fouram-40-4340-etcd-2                                             1/1     Running     0               5m59s   10.104.22.82    4am-node26   <none>           <none>
fouram-40-4340-milvus-datacoord-6fc7cd4bf6-s6qx4                  1/1     Running     1 (119s ago)    6m1s    10.104.21.133   4am-node24   <none>           <none>
fouram-40-4340-milvus-datanode-5895fdf7f6-nzzvg                   1/1     Running     1 (2m1s ago)    6m1s    10.104.19.46    4am-node28   <none>           <none>
fouram-40-4340-milvus-indexcoord-667bf785c6-vz6mb                 1/1     Running     0               6m1s    10.104.18.125   4am-node25   <none>           <none>
fouram-40-4340-milvus-indexnode-799d598f59-7j52g                  1/1     Running     0               6m1s    10.104.18.126   4am-node25   <none>           <none>
fouram-40-4340-milvus-proxy-766bc877b8-vh4gf                      1/1     Running     1 (2m ago)      6m1s    10.104.21.129   4am-node24   <none>           <none>
fouram-40-4340-milvus-querycoord-577dc7f59f-t7bc7                 1/1     Running     1 (2m1s ago)    6m1s    10.104.19.45    4am-node28   <none>           <none>
fouram-40-4340-milvus-querynode-7b99dbdc45-vgqnq                  1/1     Running     0               6m1s    10.104.23.86    4am-node27   <none>           <none>
fouram-40-4340-milvus-rootcoord-5559667bb9-stpxs                  1/1     Running     1 (2m1s ago)    6m1s    10.104.23.85    4am-node27   <none>           <none>
fouram-40-4340-minio-0                                            1/1     Running     0               6m1s    10.104.21.150   4am-node24   <none>           <none>
fouram-40-4340-minio-1                                            1/1     Running     0               6m1s    10.104.15.130   4am-node20   <none>           <none>
fouram-40-4340-minio-2                                            1/1     Running     0               6m      10.104.17.240   4am-node23   <none>           <none>
fouram-40-4340-minio-3                                            1/1     Running     0               5m59s   10.104.22.80    4am-node26   <none>           <none>
fouram-40-4340-pulsar-bookie-0                                    1/1     Running     0               6m1s    10.104.21.151   4am-node24   <none>           <none>
fouram-40-4340-pulsar-bookie-1                                    1/1     Running     0               6m      10.104.17.241   4am-node23   <none>           <none>
fouram-40-4340-pulsar-bookie-2                                    1/1     Running     0               5m59s   10.104.20.235   4am-node22   <none>           <none>
fouram-40-4340-pulsar-bookie-init-r6nff                           0/1     Completed   0               6m1s    10.104.21.131   4am-node24   <none>           <none>
fouram-40-4340-pulsar-broker-0                                    1/1     Running     0               6m      10.104.15.117   4am-node20   <none>           <none>
fouram-40-4340-pulsar-proxy-0                                     1/1     Running     0               6m1s    10.104.15.116   4am-node20   <none>           <none>
fouram-40-4340-pulsar-pulsar-init-s7cv8                           0/1     Completed   0               6m1s    10.104.21.130   4am-node24   <none>           <none>
fouram-40-4340-pulsar-recovery-0                                  1/1     Running     0               6m1s    10.104.21.132   4am-node24   <none>           <none>
fouram-40-4340-pulsar-zookeeper-0                                 1/1     Running     0               6m1s    10.104.21.149   4am-node24   <none>           <none>
fouram-40-4340-pulsar-zookeeper-1                                 1/1     Running     0               4m32s   10.104.17.250   4am-node23   <none>           <none>
fouram-40-4340-pulsar-zookeeper-2                                 1/1     Running     0               3m35s   10.104.5.30     4am-node12   <none>           <none>

07:12:53 server (after):

fouram-40-4340-etcd-0                                             1/1     Running            0                67m     10.104.17.239   4am-node23   <none>           <none>
fouram-40-4340-etcd-1                                             1/1     Running            0                67m     10.104.21.153   4am-node24   <none>           <none>
fouram-40-4340-etcd-2                                             1/1     Running            0                67m     10.104.22.82    4am-node26   <none>           <none>
fouram-40-4340-milvus-datacoord-6fc7cd4bf6-s6qx4                  1/1     Running            1 (63m ago)      67m     10.104.21.133   4am-node24   <none>           <none>
fouram-40-4340-milvus-datanode-5895fdf7f6-nzzvg                   1/1     Running            1 (63m ago)      67m     10.104.19.46    4am-node28   <none>           <none>
fouram-40-4340-milvus-indexcoord-667bf785c6-vz6mb                 1/1     Running            0                67m     10.104.18.125   4am-node25   <none>           <none>
fouram-40-4340-milvus-indexnode-799d598f59-7j52g                  1/1     Running            0                67m     10.104.18.126   4am-node25   <none>           <none>
fouram-40-4340-milvus-proxy-766bc877b8-vh4gf                      1/1     Running            1 (63m ago)      67m     10.104.21.129   4am-node24   <none>           <none>
fouram-40-4340-milvus-querycoord-577dc7f59f-t7bc7                 1/1     Running            1 (63m ago)      67m     10.104.19.45    4am-node28   <none>           <none>
fouram-40-4340-milvus-querynode-7b99dbdc45-vgqnq                  0/1     Running            16 (5m14s ago)   67m     10.104.23.86    4am-node27   <none>           <none>
fouram-40-4340-milvus-rootcoord-5559667bb9-stpxs                  1/1     Running            1 (63m ago)      67m     10.104.23.85    4am-node27   <none>           <none>
fouram-40-4340-minio-0                                            1/1     Running            0                67m     10.104.21.150   4am-node24   <none>           <none>
fouram-40-4340-minio-1                                            1/1     Running            0                67m     10.104.15.130   4am-node20   <none>           <none>
fouram-40-4340-minio-2                                            1/1     Running            0                67m     10.104.17.240   4am-node23   <none>           <none>
fouram-40-4340-minio-3                                            1/1     Running            0                67m     10.104.22.80    4am-node26   <none>           <none>
fouram-40-4340-pulsar-bookie-0                                    1/1     Running            0                67m     10.104.21.151   4am-node24   <none>           <none>
fouram-40-4340-pulsar-bookie-1                                    1/1     Running            0                67m     10.104.17.241   4am-node23   <none>           <none>
fouram-40-4340-pulsar-bookie-2                                    1/1     Running            0                67m     10.104.20.235   4am-node22   <none>           <none>
fouram-40-4340-pulsar-bookie-init-r6nff                           0/1     Completed          0                67m     10.104.21.131   4am-node24   <none>           <none>
fouram-40-4340-pulsar-broker-0                                    1/1     Running            0                67m     10.104.15.117   4am-node20   <none>           <none>
fouram-40-4340-pulsar-proxy-0                                     1/1     Running            0                67m     10.104.15.116   4am-node20   <none>           <none>
fouram-40-4340-pulsar-pulsar-init-s7cv8                           0/1     Completed          0                67m     10.104.21.130   4am-node24   <none>           <none>
fouram-40-4340-pulsar-recovery-0                                  1/1     Running            0                67m     10.104.21.132   4am-node24   <none>           <none>
fouram-40-4340-pulsar-zookeeper-0                                 1/1     Running            0                67m     10.104.21.149   4am-node24   <none>           <none>
fouram-40-4340-pulsar-zookeeper-1                                 1/1     Running            0                66m     10.104.17.250   4am-node23   <none>           <none>
fouram-40-4340-pulsar-zookeeper-2                                 1/1     Running            0                65m     10.104.5.30     4am-node12   <none>           <none>

client pod: fouramf-concurrent-h9xqz-997902271 client error log: image img_v2_d5de349b-09e6-40c9-aa2f-761af6466d7g

param:

{
     "dataset_params": {
          "metric_type": "L2",
          "dim": 128,
          "dataset_name": "sift",
          "dataset_size": 1000000,
          "ni_per": 50000
     },
     "collection_params": {
          "other_fields": [
               "float_1"
          ],
          "shards_num": 2
     },
     "load_params": {},
     "query_params": {},
     "search_params": {},
     "resource_groups_params": {
          "reset": false
     },
     "index_params": {
          "index_type": "HNSW",
          "index_param": {
               "M": 8,
               "efConstruction": 200
          }
     },
     "concurrent_params": {
          "concurrent_number": 20,
          "during_time": "1h",
          "interval": 20,
          "spawn_rate": null
     },
     "concurrent_tasks": [
          {
               "type": "query",
               "weight": 10,
               "params": {
                    "expr": {
                         "float_1": {
                              "GT": -1,
                              "LT": 500000
                         }
                    },
                    "output_fields": [
                         "float_vector"
                    ],
                    "ignore_growing": false,
                    "timeout": 60
               }
          }
     ]
}

querynode metrics were not collected image

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

diskann indexes, concurrent query have the same problem

elstic avatar Jun 08 '23 07:06 elstic

/assign @jiaoew1991 /unassign

yanliang567 avatar Jun 08 '23 11:06 yanliang567

Can this issue be reproduced?

jiaoew1991 avatar Jun 14 '23 10:06 jiaoew1991

Can this issue be reproduced?

@jiaoew1991

Yes, it must be present; verified again, using the image : master-20230614-35cb0b5b

elstic avatar Jun 14 '23 11:06 elstic

@jiaoew1991 This log is confusing but will not cause any problems, so congqi has fixed it in #24741 , the root cause still needs to investigate, pls help on it~

chasingegg avatar Jun 15 '23 02:06 chasingegg

/assign @sunby /unassign

jiaoew1991 avatar Jun 15 '23 02:06 jiaoew1991

Querynode pod was killed by OOM cause by concurrent query requests.

sunby avatar Jun 26 '23 09:06 sunby

/assign @elstic /unassign

sunby avatar Jul 14 '23 03:07 sunby

@elstic how much memory resource for this test?

yanliang567 avatar Jul 22 '23 07:07 yanliang567

No more oom, but now concurrently querying the return vector still gives an error.

keep an eye on this issue: https://github.com/milvus-io/milvus/issues/25996

elstic avatar Aug 15 '23 02:08 elstic