milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Concurrent insert then trigger datanode memory protection, but do not trigger sync

Open ThreadDao opened this issue 1 year ago • 9 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: bigsheeper-2303-memory-sync-debug-2eeefc3-20230328
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

  1. deploy milvus
  • default config enable datanode.memory.forceSyncEnable
  • datanode resource
Limits:
     cpu:     2
     memory:  6Gi
   Requests:
     cpu:      2
     memory:   5Gi
  1. Create a collection and create 500 new partitions
  2. Concurrent insert data into random partitions
  3. datanode memory increase and triggers memory protection
deny to write, reason: memory quota exhausted, please allocate more resources, req: /milvus.proto.milvus.MilvusService/Insert)>
 (api_request.py:48)
  1. But do not trigger sync You can check the datanode logs in fouram cluster qa-milvus namespace
fouramf-yihao2-op-38-1978-etcd-0                                  1/1     Running            0                 32h     10.104.9.146    4am-node14   <none>           <none>
fouramf-yihao2-op-38-1978-etcd-1                                  1/1     Running            0                 32h     10.104.5.245    4am-node12   <none>           <none>
fouramf-yihao2-op-38-1978-etcd-2                                  1/1     Running            0                 32h     10.104.6.55     4am-node13   <none>           <none>
fouramf-yihao2-op-38-1978-milvus-datacoord-7887849cbb-hhzdc       1/1     Running            0                 32h     10.104.4.125    4am-node11   <none>           <none>
fouramf-yihao2-op-38-1978-milvus-datanode-b54544f89-4knlc         1/1     Running            0                 3h36m   10.104.9.215    4am-node14   <none>           <none>
fouramf-yihao2-op-38-1978-milvus-indexcoord-688578f67b-5f6cf      1/1     Running            0                 32h     10.104.5.253    4am-node12   <none>           <none>
fouramf-yihao2-op-38-1978-milvus-indexnode-79df85b75-rtgvt        1/1     Running            0                 32h     10.104.9.153    4am-node14   <none>           <none>
fouramf-yihao2-op-38-1978-milvus-proxy-6f64df69c-pmmxh            1/1     Running            0                 32h     10.104.5.254    4am-node12   <none>           <none>
fouramf-yihao2-op-38-1978-milvus-querycoord-78bb4765c6-hczvm      1/1     Running            0                 32h     10.104.1.83     4am-node10   <none>           <none>
fouramf-yihao2-op-38-1978-milvus-querynode-55bccdbb59-cw8gd       1/1     Running            0                 32h     10.104.4.126    4am-node11   <none>           <none>
fouramf-yihao2-op-38-1978-milvus-rootcoord-7c6d58879d-jbgbs       1/1     Running            0                 32h     10.104.5.2      4am-node12   <none>           <none>
fouramf-yihao2-op-38-1978-minio-0                                 1/1     Running            0                 32h     10.104.9.147    4am-node14   <none>           <none>
fouramf-yihao2-op-38-1978-minio-1                                 1/1     Running            0                 32h     10.104.5.248    4am-node12   <none>           <none>
fouramf-yihao2-op-38-1978-minio-2                                 1/1     Running            0                 32h     10.104.1.82     4am-node10   <none>           <none>
fouramf-yihao2-op-38-1978-minio-3                                 1/1     Running            0                 32h     10.104.4.115    4am-node11   <none>           <none>
fouramf-yihao2-op-38-1978-pulsar-bookie-0                         1/1     Running            0                 32h     10.104.9.150    4am-node14   <none>           <none>
fouramf-yihao2-op-38-1978-pulsar-bookie-1                         1/1     Running            0                 32h     10.104.5.252    4am-node12   <none>           <none>
fouramf-yihao2-op-38-1978-pulsar-bookie-2                         1/1     Running            0                 32h     10.104.4.124    4am-node11   <none>           <none>
fouramf-yihao2-op-38-1978-pulsar-bookie-init-kkrlh                0/1     Completed          0                 32h     10.104.9.140    4am-node14   <none>           <none>
fouramf-yihao2-op-38-1978-pulsar-broker-0                         1/1     Running            0                 32h     10.104.5.243    4am-node12   <none>           <none>
fouramf-yihao2-op-38-1978-pulsar-proxy-0                          1/1     Running            0                 32h     10.104.5.244    4am-node12   <none>           <none>
fouramf-yihao2-op-38-1978-pulsar-pulsar-init-q7gz9                0/1     Completed          0                 32h     10.104.9.141    4am-node14   <none>           <none>
fouramf-yihao2-op-38-1978-pulsar-recovery-0                       1/1     Running            0                 32h     10.104.9.142    4am-node14   <none>           <none>
fouramf-yihao2-op-38-1978-pulsar-zookeeper-0                      1/1     Running            0                 32h     10.104.5.249    4am-node12   <none>           <none>
fouramf-yihao2-op-38-1978-pulsar-zookeeper-1                      1/1     Running            0                 32h     10.104.6.57     4am-node13   <none>           <none>
fouramf-yihao2-op-38-1978-pulsar-zookeeper-2                      1/1     Running            0                 32h     10.104.9.152    4am-node14   <none>           <none>

Grafana metrics: image

Uploading image.png…

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

ThreadDao avatar Mar 31 '23 11:03 ThreadDao

/assign @bigsheeper

ThreadDao avatar Mar 31 '23 11:03 ThreadDao

flowgraph & time tick blocked: image

image

image

bigsheeper avatar Mar 31 '23 11:03 bigsheeper

This msgstream consumes most of the memory:

image

bigsheeper avatar Mar 31 '23 11:03 bigsheeper

/unassign

yanliang567 avatar Mar 31 '23 11:03 yanliang567

RootCoord produced tt latency is healthy, means the problem is not with the network or pulsar:

image

bigsheeper avatar Apr 03 '23 07:04 bigsheeper

Proxy sent the same channel ts all the time: image

bigsheeper avatar Apr 04 '23 08:04 bigsheeper

Here are the root causes:

  1. Proxy’s time tick depends on the DML tasks in queue. It uses the minimum timestamp of DML tasks as the time tick when notifying root coordinator to update time tick. Therefore, Proxy needs to pop DML tasks when it finishes or fails.
  2. However, Proxy failed to describe collection at popPChanStats when popping DML task.
[2023/04/07 11:06:08.370 +08:00] [WARN] [retry/retry.go:44] ["retry func failed"] ["retry time"=0] [error="rpc error: code = Canceled desc = context canceled"]
[2023/04/07 11:06:08.370 +08:00] [WARN] [grpcclient/client.go:318] ["ClientBase ReCall grpc first call get error"] [role=rootcoord] [error="err: rpc error: code = Canceled desc = context canceled\n, /home/sheep/workspace/milvus/internal/util/trace/stack_trace.go:51 github.com/milvus-io/milvus/internal/util/trace.StackTrace\n/home/sheep/workspace/milvus/internal/util/grpcclient/client.go:317 github.com/milvus-io/milvus/internal/util/grpcclient.(*ClientBase[...]).ReCall\n/home/sheep/workspace/milvus/internal/distributed/rootcoord/client/client.go:233 github.com/milvus-io/milvus/internal/distributed/rootcoord/client.(*Client).DescribeCollection\n/home/sheep/workspace/milvus/internal/proxy/meta_cache.go:500 github.com/milvus-io/milvus/internal/proxy.(*MetaCache).describeCollection\n/home/sheep/workspace/milvus/internal/proxy/meta_cache.go:224 github.com/milvus-io/milvus/internal/proxy.(*MetaCache).GetCollectionID\n/home/sheep/workspace/milvus/internal/proxy/task_insert.go:86 github.com/milvus-io/milvus/internal/proxy.(*insertTask).getChannels\n/home/sheep/workspace/milvus/internal/proxy/task_scheduler.go:330 github.com/milvus-io/milvus/internal/proxy.(*dmTaskQueue).popPChanStats\n/home/sheep/workspace/milvus/internal/proxy/task_scheduler.go:271 github.com/milvus-io/milvus/internal/proxy.(*dmTaskQueue).PopActiveTask\n/home/sheep/workspace/milvus/internal/proxy/task_scheduler.go:477 github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask.func1\n/home/sheep/workspace/milvus/internal/proxy/task_scheduler.go:511 github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask\n"]

https://github.com/milvus-io/milvus/blob/9c463750330801057cc515a4b129eb6f40e9a1c9/internal/proxy/task_scheduler.go#L307-L334

bigsheeper avatar Apr 07 '23 03:04 bigsheeper

@ThreadDao is it verified and fixed?

binbinlv avatar Apr 11 '23 09:04 binbinlv

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] avatar May 11 '23 11:05 stale[bot]