milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [benchmark][cluster]High initial query latency in Milvus multi-replica

Open jingkl opened this issue 2 years ago • 6 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:master-20220601-63a31ccb
- Deployment mode(standalone or cluster):cluster
- SDK version(e.g. pymilvus v2.0.0rc2):pymilvus-2.1.0.dev67
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

agro benchmark-backup-psdpt-1 server-configmap server-cluster-8c64m-querynode5 client-configmap client-random-locust-search-filter-100m-ddl-r8-w2-replica5-2h

server:

NAME                                                          READY   STATUS             RESTARTS   AGE     IP             NODE                      
NOMINATED
NODE   READINESS GATES
benchmark-backup-psdpt-1-etcd-0                               1/1     Running            0          6m3s    10.97.16.147   qa-node013.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-etcd-1                               1/1     Running            0          6m2s    10.97.17.139   qa-node014.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-etcd-2                               1/1     Running            0          6m1s    10.97.16.149   qa-node013.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-milvus-datacoord-56bdcf467b-nxk5k    1/1     Running            1          6m3s    10.97.5.249    qa-node003.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-milvus-datanode-55b8d7c849-k22sb     1/1     Running            1          6m3s    10.97.16.144   qa-node013.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-milvus-indexcoord-68d78bcccf-f2mtm   1/1     Running            1          6m3s    10.97.5.246    qa-node003.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-milvus-indexnode-6cd54dbb9d-p55pz    1/1     Running            0          6m3s    10.97.17.125   qa-node014.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-milvus-proxy-67979b957f-qtbkl        1/1     Running            1          6m3s    10.97.5.244    qa-node003.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-milvus-querycoord-7999dccb44-89rf9   1/1     Running            1          6m3s    10.97.5.245    qa-node003.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-milvus-querynode-846468b77b-d7w6l    1/1     Running            0          6m3s    10.97.17.133   qa-node014.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-milvus-querynode-846468b77b-g4szm    1/1     Running            0          6m3s    10.97.17.134   qa-node014.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-milvus-querynode-846468b77b-hpvxq    1/1     Running            0          6m3s    10.97.11.237   qa-node009.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-milvus-querynode-846468b77b-t594w    1/1     Running            0          6m3s    10.97.17.136   qa-node014.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-milvus-querynode-846468b77b-z8j9v    1/1     Running            0          6m3s    10.97.17.137   qa-node014.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-milvus-rootcoord-84cf758b76-lxqtn    1/1     Running            1          6m3s    10.97.5.250    qa-node003.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-minio-0                              1/1     Running            0          6m3s    10.97.19.222   qa-node016.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-minio-1                              1/1     Running            0          6m3s    10.97.19.224   qa-node016.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-minio-2                              1/1     Running            0          6m3s    10.97.19.238   qa-node016.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-minio-3                              1/1     Running            0          6m2s    10.97.19.239   qa-node016.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-pulsar-bookie-0                      1/1     Running            0          6m3s    10.97.5.254    qa-node003.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-pulsar-bookie-1                      1/1     Running            0          6m3s    10.97.19.237   qa-node016.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-pulsar-bookie-2                      1/1     Running            0          6m1s    10.97.5.5      qa-node003.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-pulsar-bookie-init-5tmmd             0/1     Completed          0          6m3s    10.97.5.243    qa-node003.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-pulsar-broker-0                      1/1     Running            0          6m3s    10.97.19.219   qa-node016.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-pulsar-proxy-0                       1/1     Running            0          6m3s    10.97.19.220   qa-node016.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-pulsar-pulsar-init-2vnj7             0/1     Completed          0          6m3s    10.97.5.247    qa-node003.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-pulsar-recovery-0                    1/1     Running            0          6m3s    10.97.5.248    qa-node003.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-pulsar-zookeeper-0                   1/1     Running            0          6m3s    10.97.5.2      qa-node003.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-pulsar-zookeeper-1                   1/1     Running            0          5m26s   10.97.9.5      qa-node007.zilliz.local   <none>           <none>
benchmark-backup-psdpt-1-pulsar-zookeeper-2                   1/1     Running            0          4m27s   10.97.3.209    qa-node001.zilliz.local   <none>           <none>

argo2: server-instance benchmark-backup-lcjsc-1 server-configmap server-cluster-8c64m-querynode2 client-configmap client-random-locust-search-filter-100m-ddl-r8-w2-replica2-2h NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES

benchmark-backup-lcjsc-1-etcd-0                               1/1     Running       0          6m17s   10.97.17.142   qa-node014.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-etcd-1                               1/1     Running       0          6m17s   10.97.16.151   qa-node013.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-etcd-2                               1/1     Running       0          6m17s   10.97.17.144   qa-node014.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-milvus-datacoord-55f5dcbf89-8w8gw    1/1     Running       1          6m17s   10.97.4.182    qa-node002.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-milvus-datanode-6bb69cf4b-dqkd9      1/1     Running       1          6m17s   10.97.19.232   qa-node016.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-milvus-indexcoord-54f4898f56-x8vlh   1/1     Running       1          6m17s   10.97.4.184    qa-node002.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-milvus-indexnode-fcc4d8bc9-6bvbj     1/1     Running       0          6m17s   10.97.20.8     qa-node018.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-milvus-proxy-5c6f48644-h6q68         1/1     Running       1          6m17s   10.97.4.185    qa-node002.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-milvus-querycoord-5f45d9456c-8lmw5   1/1     Running       1          6m17s   10.97.4.186    qa-node002.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-milvus-querynode-748c994485-r27lm    1/1     Running       0          6m17s   10.97.16.145   qa-node013.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-milvus-querynode-748c994485-v2n4k    1/1     Running       0          6m17s   10.97.12.77    qa-node015.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-milvus-rootcoord-5c8f8b85-5xg6b      1/1     Running       1          6m17s   10.97.4.183    qa-node002.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-minio-0                              1/1     Running       0          6m17s   10.97.19.242   qa-node016.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-minio-1                              1/1     Running       0          6m17s   10.97.19.243   qa-node016.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-minio-2                              1/1     Running       0          6m17s   10.97.19.246   qa-node016.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-minio-3                              1/1     Running       0          6m17s   10.97.19.247   qa-node016.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-pulsar-bookie-0                      1/1     Running       0          6m17s   10.97.3.205    qa-node001.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-pulsar-bookie-1                      1/1     Running       0          6m17s   10.97.5.8      qa-node003.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-pulsar-bookie-2                      1/1     Running       0          6m17s   10.97.18.201   qa-node017.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-pulsar-bookie-init-q7w66             0/1     Completed     0          6m17s   10.97.3.199    qa-node001.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-pulsar-broker-0                      1/1     Running       0          6m17s   10.97.9.3      qa-node007.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-pulsar-proxy-0                       1/1     Running       0          6m17s   10.97.18.196   qa-node017.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-pulsar-pulsar-init-jjsjw             0/1     Completed     0          6m17s   10.97.9.2      qa-node007.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-pulsar-recovery-0                    1/1     Running       0          6m17s   10.97.19.226   qa-node016.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-pulsar-zookeeper-0                   1/1     Running       0          6m17s   10.97.3.201    qa-node001.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-pulsar-zookeeper-1                   1/1     Running       0          5m24s   10.97.3.207    qa-node001.zilliz.local   <none>           <none>
benchmark-backup-lcjsc-1-pulsar-zookeeper-2                   1/1     Running       0          4m42s   10.97.9.7      qa-node007.zilliz.local   <none>           <none>
截屏2022-06-02 16 48 36

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

client-random-locust-search-filter-100m-ddl-r8-w2-replica5-2h:

config.yaml: |
  locust_random_performance:
    collections:
      -
        collection_name: sift_100m_128_l2
        # collection_name: sift_10w_128_l2
        other_fields: float1
        ni_per: 50000
        build_index: true
        index_type: ivf_sq8
        index_param:
          nlist: 2048
        load_param:
          replica_number: 5
        task:
          types:
            -
              type: query
              weight: 20
              params:
                top_k: 10
                nq: 10
                search_param:
                  nprobe: 16
                filters:
                  -
                    range: "{'range': {'float1': {'GT': -1.0, 'LT': collection_size * 0.5}}}"
            -
              type: load
              weight: 1
              params:
                replica_number: 5
            -
              type: get
              weight: 10
              params:
                ids_length: 10
            -
              type: scene_test
              weight: 2
          connection_num: 1
          clients_num: 20
          spawn_rate: 2
          during_time: 2h

jingkl avatar Jun 02 '22 08:06 jingkl

@jingkl you won't get performance boost if collection is large and data can be seperated evenly to 5 nodes? Not sure this is a issue because originally data is already spread evenly to 5 nodes

xiaofan-luan avatar Jun 02 '22 09:06 xiaofan-luan

multi replica only help on small data set case, for example you ingest 1m data to a huge cluster and want to double your performance

xiaofan-luan avatar Jun 02 '22 09:06 xiaofan-luan

you won't get performance boost if collection is large and data can be seperated evenly to 5 nodes?

Why, shouldn't the query time be faster for 5 replicas of the same size dataset @xiaofan-luan

jingkl avatar Jun 02 '22 09:06 jingkl

you won't get performance boost if collection is large and data can be seperated evenly to 5 nodes?

Why, shouldn't the query time be faster for 5 replicas of the same size dataset @xiaofan-luan

Because even if you have 1 replica, all 5 querynodes loaded data. See if you have 10 segments, then each querynodes load 2 segments. Once you changed to 5 replica, then each node will load 10 segments. The case multi replica would help is you only have 1 segment but 5 querynodes, only one querynodes can load the data while rest of the querynodes didn't have any data to serve

xiaofan-luan avatar Jun 02 '22 11:06 xiaofan-luan

Because even if you have 1 replica, all 5 querynodes loaded data. See if you have 10 segments, then each querynodes load 2 segments. Once you changed to 5 replica, then each node will load 10 segments. The case multi replica would help is you only have 1 segment but 5 querynodes, only one querynodes can load the data while rest of the querynodes didn't have any data to serve

I understand, the latency of the query is too high at first, what is the reason for this?

jingkl avatar Jun 02 '22 12:06 jingkl

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Jul 22 '22 01:07 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Sep 10 '22 08:09 stale[bot]