milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [benchmark][cluster] Milvus reinstall with same pvc, search , query raise an error"<_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.DEADLINE_EXCEEDED"

Open jingkl opened this issue 2 years ago • 7 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:2.2.0-20221118-b494b564
- Deployment mode(standalone or cluster):cluster
- SDK version(e.g. pymilvus v2.0.0rc2):2.2.0dev72
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

After restart server-instance fouram-tag-no-clean-p4l24-1 server-configmap server-cluster-8c64m-querynode2 client-configmap client-random-locust-100m-hnsw-ddl-r8-w2-12h-con image:2.2.0-20221117-c65306bc pymilvus: 2.2.0dev70

reinstall with image 2.2.0-20221118-b494b564 server-instance fouram-tag-no-clean-z2mmt-1 server-configmap server-cluster-8c64m-querynode2-kafka client-configmap client-random-locust-100m-hnsw-ddl-r8-w2-60h-con

sever:

fouram-tag-no-clean-z2mmt-1-etcd-0                                1/1     Running     0               2d13h   10.104.5.118   4am-node12   <none>           <none>
fouram-tag-no-clean-z2mmt-1-etcd-1                                1/1     Running     0               2d13h   10.104.9.5     4am-node14   <none>           <none>
fouram-tag-no-clean-z2mmt-1-etcd-2                                1/1     Running     0               2d13h   10.104.1.8     4am-node10   <none>           <none>
fouram-tag-no-clean-z2mmt-1-kafka-0                               2/2     Running     0               2d13h   10.104.5.120   4am-node12   <none>           <none>
fouram-tag-no-clean-z2mmt-1-kafka-1                               2/2     Running     0               2d13h   10.104.9.7     4am-node14   <none>           <none>
fouram-tag-no-clean-z2mmt-1-kafka-2                               2/2     Running     0               2d13h   10.104.6.47    4am-node13   <none>           <none>
fouram-tag-no-clean-z2mmt-1-kafka-exporter-85d6df8b68-gqj7t       1/1     Running     4 (2d13h ago)   2d13h   10.104.6.42    4am-node13   <none>           <none>
fouram-tag-no-clean-z2mmt-1-milvus-datacoord-78f6c94c78-nq2dz     1/1     Running     0               2d13h   10.104.4.232   4am-node11   <none>           <none>
fouram-tag-no-clean-z2mmt-1-milvus-datanode-79dc5fd57c-drnbm      1/1     Running     0               2d13h   10.104.4.233   4am-node11   <none>           <none>
fouram-tag-no-clean-z2mmt-1-milvus-indexcoord-75587c7897-5qwxn    1/1     Running     0               2d13h   10.104.6.43    4am-node13   <none>           <none>
fouram-tag-no-clean-z2mmt-1-milvus-indexnode-86b944456b-kmwlp     1/1     Running     0               2d13h   10.104.6.45    4am-node13   <none>           <none>
fouram-tag-no-clean-z2mmt-1-milvus-proxy-8649b79c89-7tmbv         1/1     Running     0               2d13h   10.104.6.39    4am-node13   <none>           <none>
fouram-tag-no-clean-z2mmt-1-milvus-querycoord-77d97b478-cw7rz     1/1     Running     0               2d13h   10.104.6.41    4am-node13   <none>           <none>
fouram-tag-no-clean-z2mmt-1-milvus-querynode-678545b976-9pk7n     1/1     Running     0               2d13h   10.104.4.230   4am-node11   <none>           <none>
fouram-tag-no-clean-z2mmt-1-milvus-querynode-678545b976-h8djj     1/1     Running     0               2d13h   10.104.6.40    4am-node13   <none>           <none>
fouram-tag-no-clean-z2mmt-1-milvus-rootcoord-56bb644b5b-n2k4c     1/1     Running     0               2d13h   10.104.4.231   4am-node11   <none>           <none>
fouram-tag-no-clean-z2mmt-1-minio-0                               1/1     Running     0               2d13h   10.104.5.117   4am-node12   <none>           <none>
fouram-tag-no-clean-z2mmt-1-minio-1                               1/1     Running     0               2d13h   10.104.9.6     4am-node14   <none>           <none>
fouram-tag-no-clean-z2mmt-1-minio-2                               1/1     Running     0               2d13h   10.104.1.9     4am-node10   <none>           <none>
fouram-tag-no-clean-z2mmt-1-minio-3                               1/1     Running     0               2d13h   10.104.6.44    4am-node13   <none>           <none>
fouram-tag-no-clean-z2mmt-1-zookeeper-0                           1/1     Running     0               2d13h   10.104.6.46    4am-node13   <none>           <none>
fouram-tag-no-clean-z2mmt-1-zookeeper-1                           1/1     Running     0               2d13h   10.104.4.234   4am-node11   <none>           <none>
fouram-tag-no-clean-z2mmt-1-zookeeper-2                           1/1     Running     0               2d13h   10.104.5.119   4am-node12   <none>           <none>

client log: 截屏2022-11-21 10 30 48

Expected Behavior

No response

Steps To Reproduce

1. create a collection
2. build hnsw index
3. insert 100m data
4. build index again
5. load collection
6. search, load, query, scene_test search normal
7. uninstall milvus, reinstall milvus with 2.2.0-20221118-b494b564, stop 1 hours
8. search , query, raise error

Milvus Log

No response

Anything else?

data:
  config.yaml: |
    locust_random_concurrent_performance:
      collections:
        -
          collection_name: sift_100m_128_l2
          ni_per: 50000
          build_index: true
          index_type: hnsw
          index_param:
            M: 8
            efConstruction: 200
          task:
            types:
              -
                type: query
                weight: 8
                params:
                  top_k: 10
                  nq: 10
                  search_param:
                    ef: 16
              -
                type: load
                weight: 1
              -
                type: get
                weight: 8
                params:
                  ids_length: 10
              -
                type: scene_test
                weight: 2
            connection_num: 1
            clients_num: 20
            spawn_rate: 2
            during_time: 60h

jingkl avatar Nov 21 '22 02:11 jingkl

/assign @jiaoew1991 /unassign

yanliang567 avatar Nov 21 '22 03:11 yanliang567

/assign @weiliu1031 /unassign please look at this issue

jiaoew1991 avatar Nov 23 '22 04:11 jiaoew1991

  1. query coord side meet a minute-level timestamp lag, and when search/query task submit, it need to wait, until timestamp is serviceable.
  2. for now, we found that the consumed timestamp on data node is already delayed. image

and for the next,I will check for the timestamp delay reason.

weiliu1031 avatar Nov 24 '22 08:11 weiliu1031

for now, we found that root coord send tt to kafka which has a higher latency than expected(200ms). image

weiliu1031 avatar Nov 24 '22 09:11 weiliu1031

Conclusion: the speed which data node consume from kafka is much more slower than query node. so data node will send older delete message to query node, and which cause query node delta channel's tsafe is minute-level behind dml channel. so when task submitted, it will wait for timestamp serviceable until timeout(60s)

root cause: kafka's read and write performance seems unstable

weiliu1031 avatar Nov 24 '22 10:11 weiliu1031

Conclusion: the speed which data node consume from kafka is much more slower than query node. so data node will send older delete message to query node, and which cause query node delta channel's tsafe is minute-level behind dml channel. so when task submitted, it will wait for timestamp serviceable until timeout(60s)

root cause: kafka's read and write performance seems unstable

Question, why the kafka produce taken so long? If this happens in production environment, how do we catch up? Maybe we can skip some of the tts if we are cathing up?

xiaofan-luan avatar Dec 09 '22 03:12 xiaofan-luan

for instance, if delta tt is lag behind (Say 10 minutes), the resend logic happens every 1 min rather than 200 ms

xiaofan-luan avatar Dec 09 '22 03:12 xiaofan-luan

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Aug 02 '23 05:08 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Sep 03 '23 18:09 stale[bot]