milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [benchmark][cluster]Milvus datanode memory grows suddenly when it is inserted

Open jingkl opened this issue 2 years ago • 14 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version:master-20220617-074ec306
- Deployment mode(standalone or cluster):cluster 
- SDK version(e.g. pymilvus v2.0.0rc2):2.1.0dev78
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo test-etcd-no-clean-zmhf6 server-configmap server-cluster-8c64m client-configmap client-random-locust-100m-ddl-r8-w2 截屏2022-06-20 18 12 59

server:

test-etcd-no-clean-zmhf6-1-0                                    1/1     Running     0          4m12s   10.97.17.140   qa-node014.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-1                                    1/1     Running     0          4m12s   10.97.16.218   qa-node013.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-2                                    1/1     Running     0          4m12s   10.97.17.145   qa-node014.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-milvus-datacoord-fd959869f-fsdp7     1/1     Running     0          4m12s   10.97.5.217    qa-node003.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-milvus-datanode-f58fd88c4-r82w5      1/1     Running     0          4m12s   10.97.20.196   qa-node018.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-milvus-indexcoord-585468c64-gb4wh    1/1     Running     0          4m12s   10.97.5.216    qa-node003.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-milvus-indexnode-84965bc4bf-kssm4    1/1     Running     0          4m12s   10.97.11.27    qa-node009.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-milvus-proxy-574875d9fd-h9lxp        1/1     Running     0          4m12s   10.97.5.215    qa-node003.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-milvus-querycoord-858d8c9c95-wfpdg   1/1     Running     0          4m12s   10.97.4.163    qa-node002.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-milvus-querynode-7fb886d44c-pwm4p    1/1     Running     0          4m12s   10.97.17.138   qa-node014.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-milvus-rootcoord-7cbb97f4f5-8sfnl    1/1     Running     0          4m12s   10.97.4.162    qa-node002.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-minio-0                              1/1     Running     0          4m12s   10.97.19.174   qa-node016.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-minio-1                              1/1     Running     0          4m12s   10.97.19.175   qa-node016.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-minio-2                              1/1     Running     0          4m11s   10.97.19.178   qa-node016.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-minio-3                              1/1     Running     0          4m11s   10.97.19.180   qa-node016.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-pulsar-bookie-0                      1/1     Running     0          4m12s   10.97.19.176   qa-node016.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-pulsar-bookie-1                      1/1     Running     0          4m12s   10.97.17.144   qa-node014.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-pulsar-bookie-2                      1/1     Running     0          4m11s   10.97.16.221   qa-node013.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-pulsar-bookie-init-hnmk2             0/1     Completed   0          4m12s   10.97.17.137   qa-node014.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-pulsar-broker-0                      1/1     Running     0          4m12s   10.97.10.94    qa-node008.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-pulsar-proxy-0                       1/1     Running     0          4m12s   10.97.19.169   qa-node016.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-pulsar-pulsar-init-5mbnm             0/1     Completed   0          4m12s   10.97.11.26    qa-node009.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-pulsar-recovery-0                    1/1     Running     0          4m12s   10.97.20.195   qa-node018.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-pulsar-zookeeper-0                   1/1     Running     0          4m12s   10.97.10.96    qa-node008.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-pulsar-zookeeper-1                   1/1     Running     0          3m32s   10.97.19.182   qa-node016.zilliz.local   <none>           <none>
test-etcd-no-clean-zmhf6-1-pulsar-zookeeper-2                   1/1     Running     0          3m      10.97.11.29    qa-node009.zilliz.local   <none>           <none>

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

jingkl avatar Jun 20 '22 10:06 jingkl

server-instance test-etcd-no-clean-p9vrm-1 server-configmap server-cluster-8c64m client-configmap client-random-locust-100m-ddl-r8-w2-1h

master-20220620-f123d657 2.1.0dev78

test-etcd-no-clean-p9vrm-1-0                                    1/1     Running     6          72m    10.97.17.160   qa-node014.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-1                                    1/1     Running     4          72m    10.97.16.209   qa-node013.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-2                                    1/1     Running     0          72m    10.97.17.161   qa-node014.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-milvus-datacoord-7dbdc67476-29pdk    1/1     Running     12         72m    10.97.10.161   qa-node008.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-milvus-datanode-88b998d5c-2mbrd      1/1     Running     13         72m    10.97.17.157   qa-node014.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-milvus-indexcoord-7d6985579f-9q2vk   1/1     Running     13         72m    10.97.12.57    qa-node015.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-milvus-indexnode-9496f574b-njp6t     1/1     Running     10         72m    10.97.20.220   qa-node018.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-milvus-proxy-589494fd79-b94wt        1/1     Running     13         72m    10.97.10.160   qa-node008.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-milvus-querycoord-58b755795b-mdjwc   1/1     Running     12         72m    10.97.11.220   qa-node009.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-milvus-querynode-594b594d44-6gvd9    1/1     Running     10         72m    10.97.11.221   qa-node009.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-milvus-rootcoord-69c77fcf9b-pckmk    1/1     Running     12         72m    10.97.12.58    qa-node015.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-minio-0                              1/1     Running     0          72m    10.97.19.73    qa-node016.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-minio-1                              1/1     Running     0          72m    10.97.19.78    qa-node016.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-minio-2                              1/1     Running     0          72m    10.97.19.75    qa-node016.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-minio-3                              1/1     Running     0          72m    10.97.19.97    qa-node016.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-pulsar-bookie-0                      1/1     Running     0          72m    10.97.10.176   qa-node008.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-pulsar-bookie-1                      1/1     Running     0          72m    10.97.19.79    qa-node016.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-pulsar-bookie-2                      1/1     Running     0          72m    10.97.10.181   qa-node008.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-pulsar-bookie-init-64897             0/1     Completed   0          72m    10.97.12.55    qa-node015.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-pulsar-broker-0                      1/1     Running     0          72m    10.97.12.56    qa-node015.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-pulsar-proxy-0                       1/1     Running     0          72m    10.97.19.64    qa-node016.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-pulsar-pulsar-init-nffqv             0/1     Completed   0          72m    10.97.10.162   qa-node008.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-pulsar-recovery-0                    1/1     Running     0          72m    10.97.12.54    qa-node015.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-pulsar-zookeeper-0                   1/1     Running     0          72m    10.97.10.175   qa-node008.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-pulsar-zookeeper-1                   1/1     Running     0          58m    10.97.10.183   qa-node008.zilliz.local   <none>           <none>
test-etcd-no-clean-p9vrm-1-pulsar-zookeeper-2                   1/1     Running     0          16m    10.97.5.25     qa-node003.zilliz.local   <none>           <none>
截屏2022-06-20 18 18 07

jingkl avatar Jun 20 '22 10:06 jingkl

I have discovered that the memory peak matches exactly with every compaction:

image

Compaction log:

Jun 20, 2022 @ 19:37:26.900	[2022/06/20 11:37:26.900 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434038386687475714] ["timeout in seconds"=180]

Jun 20, 2022 @ 19:36:29.081	[2022/06/20 11:36:29.081 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434038371535290370] ["timeout in seconds"=180]

Jun 20, 2022 @ 19:29:49.278	[2022/06/20 11:29:49.277 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434038266717011970] ["timeout in seconds"=180]

Jun 20, 2022 @ 19:29:45.824	[2022/06/20 11:29:45.824 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434038265812615170] ["timeout in seconds"=180]

Jun 20, 2022 @ 19:23:04.028	[2022/06/20 11:23:04.028 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434038160496263169] ["timeout in seconds"=180]

Jun 20, 2022 @ 19:23:00.650	[2022/06/20 11:23:00.650 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434038159605235714] ["timeout in seconds"=180]

Jun 20, 2022 @ 19:16:26.379	[2022/06/20 11:16:26.379 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434038056254701570] ["timeout in seconds"=180]

Jun 20, 2022 @ 19:16:23.030	[2022/06/20 11:16:23.030 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434038055376519170] ["timeout in seconds"=180]

Jun 20, 2022 @ 19:09:39.579	[2022/06/20 11:09:39.579 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434037949614522369] ["timeout in seconds"=180]

Jun 20, 2022 @ 19:09:28.634	[2022/06/20 11:09:28.634 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434037946744307714] ["timeout in seconds"=180]

Jun 20, 2022 @ 19:02:30.435	[2022/06/20 11:02:30.435 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434037837115686914] ["timeout in seconds"=180]

Jun 20, 2022 @ 19:02:23.626	[2022/06/20 11:02:23.626 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434037835319738370] ["timeout in seconds"=180]

Jun 20, 2022 @ 18:55:48.090	[2022/06/20 10:55:48.090 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434037731641786372] ["timeout in seconds"=180]

Jun 20, 2022 @ 18:55:48.089	[2022/06/20 10:55:48.089 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434037731641786370] ["timeout in seconds"=180]

Jun 20, 2022 @ 18:49:04.882	[2022/06/20 10:49:04.882 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434037625945587715] ["timeout in seconds"=180]

Jun 20, 2022 @ 18:49:04.881	[2022/06/20 10:49:04.881 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434037625945587713] ["timeout in seconds"=180]

Jun 20, 2022 @ 18:41:48.291	[2022/06/20 10:41:48.291 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434037511493255170] ["timeout in seconds"=180]

Jun 20, 2022 @ 18:41:48.291	[2022/06/20 10:41:48.291 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434037511493255172] ["timeout in seconds"=180]

Jun 20, 2022 @ 18:33:53.690	[2022/06/20 10:33:53.690 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434037387079974916] ["timeout in seconds"=180]

Jun 20, 2022 @ 18:33:53.689	[2022/06/20 10:33:53.689 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434037387079974914] ["timeout in seconds"=180]

Jun 20, 2022 @ 18:26:01.345	[2022/06/20 10:26:01.344 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434037263243149314] ["timeout in seconds"=180]

Jun 20, 2022 @ 18:26:00.226	[2022/06/20 10:26:00.226 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434037262954528774] ["timeout in seconds"=180]

Jun 20, 2022 @ 18:18:11.283	[2022/06/20 10:18:11.283 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434037140035207170] ["timeout in seconds"=180]

Jun 20, 2022 @ 18:18:11.283	[2022/06/20 10:18:11.283 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434037140035207172] ["timeout in seconds"=180]

Jun 20, 2022 @ 18:11:26.900	[2022/06/20 10:11:26.900 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434037034024435719] ["timeout in seconds"=180]

Jun 20, 2022 @ 18:10:30.628	[2022/06/20 10:10:30.628 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434037019278573569] ["timeout in seconds"=180]

Jun 20, 2022 @ 18:02:43.087	[2022/06/20 10:02:43.087 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434036896713408517] ["timeout in seconds"=180]

Jun 20, 2022 @ 18:02:43.080	[2022/06/20 10:02:43.080 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434036896713408514] ["timeout in seconds"=180]

Jun 20, 2022 @ 17:54:42.080	[2022/06/20 09:54:42.080 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434036770621882371] ["timeout in seconds"=180]

Jun 20, 2022 @ 17:54:42.078	[2022/06/20 09:54:42.078 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434036770608775170] ["timeout in seconds"=180]

Jun 20, 2022 @ 17:46:56.773	[2022/06/20 09:46:56.773 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434036648633171970] ["timeout in seconds"=180]

Jun 20, 2022 @ 17:46:55.431	[2022/06/20 09:46:55.431 +00:00] [DEBUG] [compactor.go:350] ["compaction start"] [planID=434036648292646914] ["timeout in seconds"=180]
​```



Prometheus link

log link

soothing-rain avatar Jun 20 '22 15:06 soothing-rain

434038265812615170

Good catch, we probably want to add what's inside a compaction plan each time. Mean while we will quickly go through the compaction plan see if anything we can improve to avoid memory cpy

xiaofan-luan avatar Jun 21 '22 01:06 xiaofan-luan

We might has some issue calculating when is the right segment to run compaction.

xiaofan-luan avatar Jun 21 '22 05:06 xiaofan-luan

/assign @xiaofan-luan

yanliang567 avatar Jun 21 '22 09:06 yanliang567

/assign @jingkl

xiaofan-luan avatar Jun 22 '22 13:06 xiaofan-luan

pls help on verification. All index/data OOM should be fixed by 17689

xiaofan-luan avatar Jun 22 '22 13:06 xiaofan-luan

argo test-etcd-no-clean-qrlvn-1 server-configmap server-cluster-8c64m client-configmap client-random-locust-100m-ddl-r8-w2-12h master-20220622-6fdf88f4 pymilvus 2.1.0dev78

benchmark-tag-no-clean-56dcc-1-etcd-0                             1/1     Running     0          5m59s   10.97.17.236   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-etcd-1                             1/1     Running     0          5m59s   10.97.16.92    qa-node013.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-etcd-2                             1/1     Running     0          5m59s   10.97.16.102   qa-node013.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-milvus-datacoord-56cfc46df96cxg9   1/1     Running     0          5m59s   10.97.3.192    qa-node001.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-milvus-datanode-57d79c494b-rpklt   1/1     Running     1          5m59s   10.97.17.228   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-milvus-indexcoord-65c855f49l7v6c   1/1     Running     0          5m59s   10.97.17.219   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-milvus-indexnode-685c969fbfmmcgw   1/1     Running     0          5m59s   10.97.17.218   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-milvus-proxy-6c99589c4b-jhjrd      1/1     Running     1          5m59s   10.97.17.233   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-milvus-querycoord-67df8f96bjhbxq   1/1     Running     1          5m59s   10.97.17.226   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-milvus-querynode-6f86c549674xt4b   1/1     Running     0          5m59s   10.97.17.230   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-milvus-rootcoord-6c5446867-dh562   1/1     Running     1          5m59s   10.97.17.221   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-minio-0                            1/1     Running     0          5m59s   10.97.19.204   qa-node016.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-minio-1                            1/1     Running     0          5m59s   10.97.19.201   qa-node016.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-minio-2                            1/1     Running     0          5m59s   10.97.19.206   qa-node016.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-minio-3                            1/1     Running     0          5m58s   10.97.19.210   qa-node016.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-pulsar-bookie-0                    1/1     Running     0          5m59s   10.97.5.195    qa-node003.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-pulsar-bookie-1                    1/1     Running     0          5m58s   10.97.16.104   qa-node013.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-pulsar-bookie-2                    1/1     Running     0          5m58s   10.97.20.112   qa-node018.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-pulsar-bookie-init-bfkjp           0/1     Completed   0          5m59s   10.97.17.232   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-pulsar-broker-0                    1/1     Running     0          5m59s   10.97.17.234   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-pulsar-proxy-0                     1/1     Running     0          5m59s   10.97.17.231   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-pulsar-pulsar-init-2kt2c           0/1     Completed   0          5m59s   10.97.17.229   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-pulsar-recovery-0                  1/1     Running     0          5m59s   10.97.17.227   qa-node014.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-pulsar-zookeeper-0                 1/1     Running     0          5m59s   10.97.3.194    qa-node001.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-pulsar-zookeeper-1                 1/1     Running     0          5m20s   10.97.12.47    qa-node015.zilliz.local   <none>           <none>
benchmark-tag-no-clean-56dcc-1-pulsar-zookeeper-2                 1/1     Running     0          4m51s   10.97.9.158    qa-node007.zilliz.local   <none>           <none>

As the graph shows Datanode memory usage is no longer oom, this is 12 hours of datanode memory usage for 100 million data 截屏2022-06-23 14 10 54

jingkl avatar Jun 23 '22 06:06 jingkl

argo test-etcd-no-clean-n5wpb-1 server-configmap server-cluster-8c64m-kafka client-configmap client-random-locust-100m-ddl-r8-w2

master-20220622-b4f21259 pymilvus 2.1.0dev78

test-etcd-no-clean-n5wpb-1-0                                      1/1     Running     0          3m31s   10.97.17.254   qa-node014.zilliz.local   <none>           <none>
test-etcd-no-clean-n5wpb-1-1                                      1/1     Running     0          3m31s   10.97.16.143   qa-node013.zilliz.local   <none>           <none>
test-etcd-no-clean-n5wpb-1-2                                      1/1     Running     0          3m31s   10.97.17.2     qa-node014.zilliz.local   <none>           <none>
test-etcd-no-clean-n5wpb-1-kafka-0                                1/1     Running     2          3m31s   10.97.19.211   qa-node016.zilliz.local   <none>           <none>
test-etcd-no-clean-n5wpb-1-kafka-1                                1/1     Running     2          3m31s   10.97.18.247   qa-node017.zilliz.local   <none>           <none>
test-etcd-no-clean-n5wpb-1-kafka-2                                1/1     Running     1          3m31s   10.97.4.195    qa-node002.zilliz.local   <none>           <none>
test-etcd-no-clean-n5wpb-1-milvus-datacoord-5b54ccbbfd-9cghr      1/1     Running     0          3m31s   10.97.11.61    qa-node009.zilliz.local   <none>           <none>
test-etcd-no-clean-n5wpb-1-milvus-datanode-6cc89595b9-xsprv       1/1     Running     0          3m31s   10.97.16.141   qa-node013.zilliz.local   <none>           <none>
test-etcd-no-clean-n5wpb-1-milvus-indexcoord-7954b98675-g76c7     1/1     Running     0          3m32s   10.97.3.134    qa-node001.zilliz.local   <none>           <none>
test-etcd-no-clean-n5wpb-1-milvus-indexnode-746f6b9bf6-bdstb      1/1     Running     0          3m31s   10.97.17.251   qa-node014.zilliz.local   <none>           <none>
test-etcd-no-clean-n5wpb-1-milvus-proxy-58577857d9-2qhns          1/1     Running     0          3m32s   10.97.18.245   qa-node017.zilliz.local   <none>           <none>
test-etcd-no-clean-n5wpb-1-milvus-querycoord-5dcd77bc68-mlgqz     1/1     Running     0          3m32s   10.97.12.85    qa-node015.zilliz.local   <none>           <none>
test-etcd-no-clean-n5wpb-1-milvus-querynode-675fcf996b-r92j5      1/1     Running     0          3m32s   10.97.20.216   qa-node018.zilliz.local   <none>           <none>
test-etcd-no-clean-n5wpb-1-milvus-rootcoord-59fd64679d-77p94      1/1     Running     0          3m32s   10.97.18.246   qa-node017.zilliz.local   <none>           <none>
test-etcd-no-clean-n5wpb-1-minio-0                                1/1     Running     0          3m31s   10.97.12.89    qa-node015.zilliz.local   <none>           <none>
test-etcd-no-clean-n5wpb-1-minio-1                                1/1     Running     0          3m31s   10.97.19.227   qa-node016.zilliz.local   <none>           <none>
test-etcd-no-clean-n5wpb-1-minio-2                                1/1     Running     0          3m31s   10.97.19.225   qa-node016.zilliz.local   <none>           <none>
test-etcd-no-clean-n5wpb-1-minio-3                                1/1     Running     0          3m31s   10.97.19.229   qa-node016.zilliz.local   <none>           <none>
test-etcd-no-clean-n5wpb-1-zookeeper-0                            1/1     Running     0          3m31s   10.97.3.135    qa-node001.zilliz.local   <none>           <none>
test-etcd-no-clean-n5wpb-1-zookeeper-1                            1/1     Running     0          3m31s   10.97.12.88    qa-node015.zilliz.local   <none>           <none>
test-etcd-no-clean-n5wpb-1-zookeeper-2                            1/1     Running     0          3m31s   10.97.19.213   qa-node016.zilliz.local   <none>           <none>

However, the memory of the datanode in the following scenario grows gradually, up to about 2.08GB 截屏2022-06-23 14 17 19

jingkl avatar Jun 23 '22 06:06 jingkl

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Jul 23 '22 07:07 stale[bot]

2.1.0-20220726-1b33c731 pymilus 2.1.0dev103

server-instance fouram-tag-no-clean-dn9tn-1 server-configmap server-cluster-8c64m-kafka client-configmap client-random-locust-100m-ddl-r8-w2-100h

fouram-tag-no-clean-dn9tn-1-etcd-0                               1/1     Running     0               2m43s   10.104.5.15    4am-node12   <none>           <none>
fouram-tag-no-clean-dn9tn-1-etcd-1                               1/1     Running     0               2m43s   10.104.4.66    4am-node11   <none>           <none>
fouram-tag-no-clean-dn9tn-1-etcd-2                               1/1     Running     0               2m43s   10.104.9.62    4am-node14   <none>           <none>
fouram-tag-no-clean-dn9tn-1-kafka-0                              1/1     Running     1 (2m30s ago)   2m43s   10.104.9.59    4am-node14   <none>           <none>
fouram-tag-no-clean-dn9tn-1-kafka-1                              1/1     Running     1 (2m30s ago)   2m43s   10.104.4.64    4am-node11   <none>           <none>
fouram-tag-no-clean-dn9tn-1-kafka-2                              1/1     Running     1 (2m31s ago)   2m43s   10.104.6.67    4am-node13   <none>           <none>
fouram-tag-no-clean-dn9tn-1-milvus-datacoord-6d864d76d5-bpmq9    1/1     Running     0               2m43s   10.104.1.4     4am-node10   <none>           <none>
fouram-tag-no-clean-dn9tn-1-milvus-datanode-5864ddd55b-hhq9l     1/1     Running     0               2m43s   10.104.6.66    4am-node13   <none>           <none>
fouram-tag-no-clean-dn9tn-1-milvus-indexcoord-66497695c6-jlsn4   1/1     Running     0               2m43s   10.104.9.56    4am-node14   <none>           <none>
fouram-tag-no-clean-dn9tn-1-milvus-indexnode-7c8bc6f69-zh9nx     1/1     Running     0               2m43s   10.104.1.5     4am-node10   <none>           <none>
fouram-tag-no-clean-dn9tn-1-milvus-proxy-6ff77f88d6-zt84x        1/1     Running     0               2m43s   10.104.1.2     4am-node10   <none>           <none>
fouram-tag-no-clean-dn9tn-1-milvus-querycoord-999d776b7-mj9b2    1/1     Running     0               2m43s   10.104.1.3     4am-node10   <none>           <none>
fouram-tag-no-clean-dn9tn-1-milvus-querynode-84589847bc-n6s9x    1/1     Running     0               2m43s   10.104.5.11    4am-node12   <none>           <none>
fouram-tag-no-clean-dn9tn-1-milvus-rootcoord-8f4ccc977-qcvtq     1/1     Running     0               2m43s   10.104.9.55    4am-node14   <none>           <none>
fouram-tag-no-clean-dn9tn-1-minio-0                              1/1     Running     0               2m43s   10.104.5.14    4am-node12   <none>           <none>
fouram-tag-no-clean-dn9tn-1-minio-1                              1/1     Running     0               2m43s   10.104.9.61    4am-node14   <none>           <none>
fouram-tag-no-clean-dn9tn-1-minio-2                              1/1     Running     0               2m43s   10.104.6.70    4am-node13   <none>           <none>
fouram-tag-no-clean-dn9tn-1-minio-3                              1/1     Running     0               2m43s   10.104.4.70    4am-node11   <none>           <none>
fouram-tag-no-clean-dn9tn-1-zookeeper-0                          1/1     Running     0               2m43s   10.104.9.58    4am-node14   <none>           <none>
fouram-tag-no-clean-dn9tn-1-zookeeper-1                          1/1     Running     0               2m43s   10.104.4.63    4am-node11   <none>           <none>
fouram-tag-no-clean-dn9tn-1-zookeeper-2                          1/1     Running     0               2m43s   10.104.6.68    4am-node13   <none>           <none>

datanode memory: 截屏2022-07-28 10 09 28 The datanode's memory still keeps growing

This issue will be kept open

jingkl avatar Jul 28 '22 02:07 jingkl

/unassign /assign @wayblink

soothing-rain avatar Aug 03 '22 03:08 soothing-rain

@wayblink any progress on this one?

xiaofan-luan avatar Aug 10 '22 14:08 xiaofan-luan

@wayblink any progress on this one?

Still working on it. Current status is: This case only occurs in Kafka cluster and the Go memory usage is actually not that large. So it probably related to kafka CGO things instead of compaction. We are reproducing the test with heaptrack and will analyze it later.

wayblink avatar Aug 11 '22 10:08 wayblink

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar Sep 10 '22 14:09 stale[bot]