milvus [Bug]: Concurrent insert then trigger datanode memory protection, but do not trigger sync

Is there an existing issue for this?

[X] I have searched the existing issues

Environment

- Milvus version: bigsheeper-2303-memory-sync-debug-2eeefc3-20230328
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

deploy milvus

default config enable datanode.memory.forceSyncEnable
datanode resource

Limits:
     cpu:     2
     memory:  6Gi
   Requests:
     cpu:      2
     memory:   5Gi

Create a collection and create 500 new partitions
Concurrent insert data into random partitions
datanode memory increase and triggers memory protection

deny to write, reason: memory quota exhausted, please allocate more resources, req: /milvus.proto.milvus.MilvusService/Insert)>
 (api_request.py:48)

But do not trigger sync You can check the datanode logs in fouram cluster qa-milvus namespace

fouramf-yihao2-op-38-1978-etcd-0                                  1/1     Running            0                 32h     10.104.9.146    4am-node14   <none>           <none>
fouramf-yihao2-op-38-1978-etcd-1                                  1/1     Running            0                 32h     10.104.5.245    4am-node12   <none>           <none>
fouramf-yihao2-op-38-1978-etcd-2                                  1/1     Running            0                 32h     10.104.6.55     4am-node13   <none>           <none>
fouramf-yihao2-op-38-1978-milvus-datacoord-7887849cbb-hhzdc       1/1     Running            0                 32h     10.104.4.125    4am-node11   <none>           <none>
fouramf-yihao2-op-38-1978-milvus-datanode-b54544f89-4knlc         1/1     Running            0                 3h36m   10.104.9.215    4am-node14   <none>           <none>
fouramf-yihao2-op-38-1978-milvus-indexcoord-688578f67b-5f6cf      1/1     Running            0                 32h     10.104.5.253    4am-node12   <none>           <none>
fouramf-yihao2-op-38-1978-milvus-indexnode-79df85b75-rtgvt        1/1     Running            0                 32h     10.104.9.153    4am-node14   <none>           <none>
fouramf-yihao2-op-38-1978-milvus-proxy-6f64df69c-pmmxh            1/1     Running            0                 32h     10.104.5.254    4am-node12   <none>           <none>
fouramf-yihao2-op-38-1978-milvus-querycoord-78bb4765c6-hczvm      1/1     Running            0                 32h     10.104.1.83     4am-node10   <none>           <none>
fouramf-yihao2-op-38-1978-milvus-querynode-55bccdbb59-cw8gd       1/1     Running            0                 32h     10.104.4.126    4am-node11   <none>           <none>
fouramf-yihao2-op-38-1978-milvus-rootcoord-7c6d58879d-jbgbs       1/1     Running            0                 32h     10.104.5.2      4am-node12   <none>           <none>
fouramf-yihao2-op-38-1978-minio-0                                 1/1     Running            0                 32h     10.104.9.147    4am-node14   <none>           <none>
fouramf-yihao2-op-38-1978-minio-1                                 1/1     Running            0                 32h     10.104.5.248    4am-node12   <none>           <none>
fouramf-yihao2-op-38-1978-minio-2                                 1/1     Running            0                 32h     10.104.1.82     4am-node10   <none>           <none>
fouramf-yihao2-op-38-1978-minio-3                                 1/1     Running            0                 32h     10.104.4.115    4am-node11   <none>           <none>
fouramf-yihao2-op-38-1978-pulsar-bookie-0                         1/1     Running            0                 32h     10.104.9.150    4am-node14   <none>           <none>
fouramf-yihao2-op-38-1978-pulsar-bookie-1                         1/1     Running            0                 32h     10.104.5.252    4am-node12   <none>           <none>
fouramf-yihao2-op-38-1978-pulsar-bookie-2                         1/1     Running            0                 32h     10.104.4.124    4am-node11   <none>           <none>
fouramf-yihao2-op-38-1978-pulsar-bookie-init-kkrlh                0/1     Completed          0                 32h     10.104.9.140    4am-node14   <none>           <none>
fouramf-yihao2-op-38-1978-pulsar-broker-0                         1/1     Running            0                 32h     10.104.5.243    4am-node12   <none>           <none>
fouramf-yihao2-op-38-1978-pulsar-proxy-0                          1/1     Running            0                 32h     10.104.5.244    4am-node12   <none>           <none>
fouramf-yihao2-op-38-1978-pulsar-pulsar-init-q7gz9                0/1     Completed          0                 32h     10.104.9.141    4am-node14   <none>           <none>
fouramf-yihao2-op-38-1978-pulsar-recovery-0                       1/1     Running            0                 32h     10.104.9.142    4am-node14   <none>           <none>
fouramf-yihao2-op-38-1978-pulsar-zookeeper-0                      1/1     Running            0                 32h     10.104.5.249    4am-node12   <none>           <none>
fouramf-yihao2-op-38-1978-pulsar-zookeeper-1                      1/1     Running            0                 32h     10.104.6.57     4am-node13   <none>           <none>
fouramf-yihao2-op-38-1978-pulsar-zookeeper-2                      1/1     Running            0                 32h     10.104.9.152    4am-node14   <none>           <none>

Grafana metrics:

Uploading image.png…

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

Mar 31 '23 11:03 ThreadDao

/assign @bigsheeper

Mar 31 '23 11:03 ThreadDao

flowgraph & time tick blocked:

Mar 31 '23 11:03 bigsheeper

This msgstream consumes most of the memory:

Mar 31 '23 11:03 bigsheeper

/unassign

Mar 31 '23 11:03 yanliang567

RootCoord produced tt latency is healthy, means the problem is not with the network or pulsar:

Apr 03 '23 07:04 bigsheeper

Proxy sent the same channel ts all the time:

Apr 04 '23 08:04 bigsheeper

Here are the root causes:

Proxy’s time tick depends on the DML tasks in queue. It uses the minimum timestamp of DML tasks as the time tick when notifying root coordinator to update time tick. Therefore, Proxy needs to pop DML tasks when it finishes or fails.
However, Proxy failed to describe collection at popPChanStats when popping DML task.

[2023/04/07 11:06:08.370 +08:00] [WARN] [retry/retry.go:44] ["retry func failed"] ["retry time"=0] [error="rpc error: code = Canceled desc = context canceled"]
[2023/04/07 11:06:08.370 +08:00] [WARN] [grpcclient/client.go:318] ["ClientBase ReCall grpc first call get error"] [role=rootcoord] [error="err: rpc error: code = Canceled desc = context canceled\n, /home/sheep/workspace/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/home/sheep/workspace/milvus/internal/util/grpcclient/client.go:317 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/home/sheep/workspace/milvus/internal/distributed/rootcoord/client/client.go:233 github.com/milvus-io/milvus/internal/distributed/rootcoord/client.(*Client).DescribeCollection\n/home/sheep/workspace/milvus/internal/proxy/meta_cache.go:500 github.com/milvus-io/milvus/internal/proxy.(*MetaCache).describeCollection\n/home/sheep/workspace/milvus/internal/proxy/meta_cache.go:224 github.com/milvus-io/milvus/internal/proxy.(*MetaCache).GetCollectionID\n/home/sheep/workspace/milvus/internal/proxy/task_insert.go:86 github.com/milvus-io/milvus/internal/proxy.(*insertTask).getChannels\n/home/sheep/workspace/milvus/internal/proxy/task_scheduler.go:330 github.com/milvus-io/milvus/internal/proxy.(*dmTaskQueue).popPChanStats\n/home/sheep/workspace/milvus/internal/proxy/task_scheduler.go:271 github.com/milvus-io/milvus/internal/proxy.(*dmTaskQueue).PopActiveTask\n/home/sheep/workspace/milvus/internal/proxy/task_scheduler.go:477 github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask.func1\n/home/sheep/workspace/milvus/internal/proxy/task_scheduler.go:511 github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask\n"]

https://github.com/milvus-io/milvus/blob/9c463750330801057cc515a4b129eb6f40e9a1c9/internal/proxy/task_scheduler.go#L307-L334

Apr 07 '23 03:04 bigsheeper

@ThreadDao is it verified and fixed?

Apr 11 '23 09:04 binbinlv

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

May 11 '23 11:05 stale[bot]

milvus milvus copied to clipboard

[Bug]: Concurrent insert then trigger datanode memory protection, but do not trigger sync

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

milvus
milvus copied to clipboard